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Abstract 


The  Parallel  Evaluation  and  Experimentation  Platfcxm  (PEEP)  is  the  result  of  an  efftxt 
at  Rome  Laboratory  to  idoitify  the  most  promising  software  development  tools,  techniques 
and  approaches  from  industry  and  academia  for  programming  high  performance  parallel 
cmnputers  to  meet  the  needs  of  Command  and  Control  (C^)  applications.  The  PEEP  is  a 
prototype  platfram  for  evaluating  the  applicability  of  results  fr^  parallel  programming 
research  efforts  to  improve  the  productivity  of  designers  and  devel(^)ers.  Intermetrics 
conducted  a  study  of  available  innovative  tools  and  techniques,  beginning  in  early  1990. 
From  the  survey,  we  chose  candidates  for  inclusion  on  a  prototype  platfcxm,  and  began  to 
install  and  evaluate  the  chosen  components.  With  die  prototype  PEEP,  we  conducted  a 
number  of  experiments  developing  small  parallel  programs  using  the  selected  tools.  The 
purpose  of  these  experiments  was  not  to  advance  die  state  of  the  art  in  parallel  algorithms, 
but  to  exercise  the  tools  collected  for  the  prototype  PEEP.  Based  on  this  work,  we  identified 
requirements  on  architectures,  life  cycle  activities  and  technologies  to  support  parallel 
develc^ment  and  developed  a  long  range  plan  for  the  PEEP.  Our  conclusions  from  these 
experiments  also  suggest  useful  methodologies  for  developing  parallel  software,  and  have 
led  to  recommendatitms  based  on  the  performance  of  the  current  tools  and  on  the  projected 
needs  of  parallel  software  development 
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1  Introduction 


1.1  The  Challenge 

The  technology  for  parallel  processing  is  evolving  rapidly.  New  hardware 
architectures  are  being  introduced  each  year  in  bodi  the  research  and  commercial  seaors. 
Following  these  developments,  and  sometimes  in  advance  of  them,  is  a  multitude  of 
approaches  to  the  development  of  applications  capable  of  exploiting  these  “high 
p^(»mance”  parallel  processing  architectures.  However,  thm  is  neither  consensus  on  the 
“best”  processors  nor  the  “best”  methods  for  developing  applications  to  exploit  these 
architectures — nor  is  there  likely  to  be.  RathCT,  a  number  of  Averse  architectures  will 
emerge  and  coexist,  as  will  a  multitude  of  methods  for  software  development  Therefore,  to 
exploit  the  potential  of  parallelism,  the  challenge  is  K>  provide  a  means  m  match  the  “best” 
architecture(s)  with  the  “best”  method(s)  for  a  given  applicatitxi  problem. 

While  high-performance,  parallel  architectures  have  received  much  attention  for 
Scientific  Coii:q)uting,  their  suitability  to  embedded  Comnoand  and  Control  (C?)  applications 
has  been  less  studied.  The  potential  in  is  somewhat  different  fiom  Scientific  Ccmiputing, 
as  a  result  of  the  different  characteristics  of  the  problem  space.  For  exanq)le,  many  scientific 
applications  consist  of  fairly  single  calculations  carried  out  on  a  massive  scale — ^for  such 
applications  a  Single  Instruction  Multiple  Data  model  is  apprr^matt.  However, 
applications  oftm  deal  with  large  numbers  of  distinct,  auttxiomous  entities,  each  with  its 
own  state  and  set  of  behaviors,  as  in  simulatitxi,  tracking,  or  Battle  Management  Because 
applications  are  often  embedded  systems,  much  of  their  code  must  deal  with  interactions 
with  other  systems  (e.g.,  timing,  event  handling),  in  contrast  to  scientific  applications  which 
are  purely  transformational  (input/butput-oriented).  Furthermore,  as  enabled  systems, 
these  applications  may  have  much  hitter  Reliatnlity,  MaintainaMlity  and  Availability 
requirements  than  laboituoty  appucations.  Ccxisequently,  the  ccxnplexity  of  tiiese 
applications  results  in  software  which  dwarfs  the  “codes”  developed  for  scientific 
confuting.  Unfortunately,  there  is  little  experience  in  parallel  programming-in-the-large — 
meeting  this  challenge  is  the  goal  of  Software  Engineering  for  Parallel  Architectures 
(SEPA)  project 

1.2  Background 

The  emerging  generation  of  parallel  processors  offer  unprecedented  q)portunities  for 
high  performance  computing,  in  terms  of  pushing  the  limits  of  what  may  be  computed,  and 
doing  so  witir  greater  accuracy,  efficiency,  timdiness,  and  reliability. 

Hardware  architectures  for  parallel  processing  may  be  usefully  characterized  as  falling 
into  three  classes:  Multiple  Instruction  Multiple  Data  (MIMD),  Single  Instruction  Multiple 
Data  (SIMD),  and  mixed  systons.  Within  the  class  of  MIMD  architectures,  the  principal 
parameter  of  variation  is  the  interconnect  strategy — the  topology  of  connections  among 
processors.  The  range  of  interconnection  strategies  varies  widely,  fiom  pair-wise 
connections  intended  to  make  arbitrary  communication  sufficiently  fast  for  typical 
computations,  to  specialized  busses,  grids,  and  hierarchical  structures,  usually  intended  to 
optimize  communications  under  some  set  of  assumptims.  MIMD  architectures  may  utilized 
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shared  memoiy  between  processors,  or  message  passing,  or  both.  This  mix  is  another 
important  parameter  of  MIMD  systems. 

There  ate  several  commercially  available  MIMD  processors  including  the  BBN 
Butterfly,  the  Sequent  machine,  “hyper  cubes”  (so  nam^  for  their  interconnection  scheme) 
from  a  number  of  vendors,  and  others.  MIMD  machines  are  typically  construaed  from 
“stock”  processors  such  as  the  MC680X0. 

In  die  class  of  SIMD  architectures  diere  are  fewer  commercially  available  machines. 
The  Connection  Machine,  is  perhtqis  the  best  known;  array  processors  and  vector 
processes  also  fall  into  this  class.  As  with  MIMD  architectures,  the  principal  parameter  of 
variadcm  is  the  interconnection  scheme. 

There  are  a  variety  of  mixed  architectures.  For  example,  Cray  computers  are  built  from 
one  to  four  Single  Instruction  Single  Data  (SISD)  processors,  integrated  with  vector 
processing  elements.  Signal  processing  conqiuters  are  also  mixed  architectures,  typically 
c(»itaining  sequential  processors,  application-specific  elements,  and  array  processors. 

Dataflow  machines  present  a  highly  parallel  ‘‘virtual  machine”  but  may  in  fact  be 
constructed  of  any  types  underlying  hardware. 

Neural  network  architectures  are  another  mixed  system — constructed  from  a  variety  of 
sequential,  array,  and  specialized  processors — but  giving  die  appearance  of  SIMD;  each 
neuronal  processing  element  has  its  own  data  but  foUows  a  sin^e  processing  rule:  to 
produce  its  output  by  firing  when  die  weighted  sum  of  its  inputs  exceeds  a  certain 
threshold. 

Of  course  to  achieve  die  goals  of  high  perfmmance  cm^iuting,  parallel  hardware  isn’t 
enough — ^we  need  software  capable  of  expldting  its  ,^abilities.  Thus,  to  exploit  parallel 
architectures  we  must  ask,  W^e  will  this  Software  come  from?  There  are  several 
alternatives:  (1)  existing,  sequential  software  may  be  reused  as-is;  (2)  sequential  software 
may  be  re-engineered  to  exploit  parallelism;  and  (3)  new  software  may  be  developed  which 
exploits  the  parallel  cs^abilities  of  hardware  through  its  application  of  new  (parallel) 
algorithms  and  data  structures.  Alternatives  (2)  and  (3)  require  development;  unfortunately, 
software  development  is  hard,  even  for  today’s  well-undnrstood  sequential  machines.  The 
discipline  of  Software  Engineering  has  emerged  over  the  past  30  years  out  of  the  field  of 
Computer  Science  to  confront  die  issues  of  develt^nng  large,  critical  software  applications. 
Software  Engmeering  contributes  models,  languages,  tactical  methods,  and  strategic 
processes  for  organizing  software  development 

Although  we  now  are  beginning  to  have  adequate  models  of  sequential  computing,  we 
lack  con^arably  powerful  formal  models  of  parallel  confutation.  Although  it  is  easy  to 
generalize  a  sequential  model  to  a  parallel  collection  of  sequential  "threads,”  there  are  two 
further  considerations:  coordination  between  threads,  and  the  issues  (tf  parallel  data.  That 
must  be  factored  into  a  process  for  engineering  parallel  software.  We  lack  adequate  theories 
in  these  areas  which  are  suitably  rigorous  and  sufficiently  expressive  to  be  generally  useful. 
This  seems  to  be  because  our  concepts  in  these  areas  are  still  highly  architecture-specific,  or 
so  low-level  as  to  be  ineffectual  for  large  systems  (in  the  same  way  that  attempting  to 


program  an  automated  teller  system  using  a  Turing  machine  formalism  would  be 
iiwffectual). 

As  we  attempt  to  raise  the  level  of  software  engineering  for  parallel  systems,  we  face 
issues  such  as: 

•  What  intact  do  parallel  architectures  have  on  the  design  process?  And,  what  tools 
can  help  to  manage  this  impact? 

•  How  does  one  understand  and  isolate  target  machine  dependencies? 

•  What  are  the  trade-offs  between  portabili^  and  efficiency? 

•  How  does  language  choice  influence  design? 

To  address  questions  such  as  these,  the  SEPA  program  built  a  prototype  framework 
within  which  to  evaluate  current  parallel  processing  technologies:  hardware,  software,  and 
methods.  This  framework  is  the  Partdlel  Evaluation  and  Experimentation  Platform  (PEEP). 

U  Overview  of  Approach 

In  this  section,  we  outline  our  approach  to  develqring  the  PEEP. 

Based  on  an  initial  understanding  of  the  problem  domain,  we  began  by  surveying 
currendy  available  capabilities  to  support  parallel  development  These  capab^ties  iiKlude: 
parallel  programming  languages,  parcel  programming  tools  and  environments,  software 
design  and  analysis  tools  and  mediods,  aral  the  underiying  hardware  needed  to  host  tiiese 
capabilities.  The  survey  led  to  an  identification  of  a  number  of  promising,  publicly  available 
tools  and  techniques  which  we  characterized  for  their  applicability  to  various  parallel 
programming  pi^lems.  The  resulting  preliminary  “tod^problem  matrix”,  documented  in 
the  survey  report  led  to  the  iiutial  population  of  the  PEEP.  We  have  subsequendy  refined 
the  initial  matrix  based  on  further  experimentation  and  analysis.  The  origiiud  and  revised 
results  are  discussed  in  section  2. 

The  most  visible  result  r^f  die  SEPA  program  is  of  course  the  PEEP  itself,  which  has 
been  defined,  configured,  cpoated  and  de^er^  within  die  course  of  this  effort  Section  3 
provides  an  overview  of  the  PEEP,  its  design  and  prototype  inqilementation,  and  a 
discussion  of  its  continued  evolution. 

The  PEEP  served  as  a  base  upon  which  to  evaluate  parallel  development  methods. 
Three  experiments  were  defined  a^  carried  out  reflecting  three  different  “problems”  in 
software  development  for  parallel  systems.  In  section  4,  we  describe  this  series  of 
experiments,  and  our  findings  with  regard  to  methodology  and  the  usability  of  tools  and 
techniques  in  supptvt  of  that  methodology. 

Section  5  summarizes  our  overall  conclusions  and  recommendations  fn*  the  future  in 
terms  of  requirements  for  the  software  engineering  of  parallel  systems,  and  a  SEPA  long¬ 
term  plan. 
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2  Survey  of  Parallel  Processing  Tools  and  Techniques 

2.1  Introduction 

As  a  part  of  the  SEPA  program,  we  surveyed  existing,  available  technology  for  parallel 
architectures.  We  began  with  a  literature  search,  using  the  facilities  of  the  MIT  libraries,  and 
a  number  of  computerized  searching  systems.  In  many  cases  we  followed  up  by  contacting 
the  researchers,  fiiKiing  references  to  current  and  upcoming  projects  as  well.  As  a 
foundation,  we  had  access  to  two  recent  surveys;  Survey  of  Parallel  Computing  [Miller, 

1989]  and  Software  Techniques  for  Non- Von  Neumann  Architectures  [Lightfoot,  et  al., 

1990] .  Our  work  was  meant  to  supplement,  not  replace,  these  existing  surveys.  The  focus  of 
this  survey  was  parallel  software  develq)n3ent  tools.  For  this  reason  operating  systems  such 
as  Mach  or  CrcMius  were  not  oxisidered  as  part  of  the  survey  sirtee  they  do  not  fit  in  the 
tools  category.  It  is  important  to  note  that  the  survey  was  not  exhaustive,  and  that  it  has  a 
bias  towa^  University  projects  rather  than  commercial  products.  Papers  in  the  literature 
tend  to  be  the  results  of  academic  research  rather  than  descriptions  of  commercial  tools. 

The  technologies  we  surveyed  fell  into  several  categories: 

•  programming  environments  and  t(X}ls  relevant  tt>  parallel  programming 

•  program  visualizaticxi  (including  program  animation  and  modeling) 

•  parallel  languages 

The  purpose  of  surveying  the  first  category  was  two-fold:  to  identify  promising 
approaches  to  integrated  framewoiks  (i.e.,  environments)  in  support  of  parallel  program 
develc^ment;  and  to  identify  promising  tcx>ls  supporting  some  aspect  of  parallel  program 
developnaent 

Under  this  category,  the  following  tools  and  environments  were  surveyed 
(descriptions  of  these,  and  all  other  t(x>ls  mentioned  in  diis  report  are  provided  in  appendix 
B): 
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The  following  systems  were  examined  as  visualization  systems: 

BALSA-n 

TANGO 


We  initially  looked  at  the  following  parallel  languages: 

Multilisp 

Occam 

Linda 

Molecule 

Crystal 

Paralation  model 

C* 

Id 

Unity 

UC 

Joyce 
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2.2  Tool/Problem  Matrix 

Based  on  the  infonnation  gathered  in  the  survey  we  initially  developed  a  tool/problem 
solution  matrix  covering  the  software  develc^ment  problems  of  parallel  software,  and  some 
additional  problems  relating  to  quality  and  management  concerns  of  large  software  systems. 

It  was  interesting  to  note  that  most  of  die  tools  surveyed  do  not  address  the  quality 
and  management  aspects  of  develq)ing  software.  The  primary  reastm  these  issues  are  not 
addressed  in  the  tools  surveyed  is  protnbly  that  these  tools  are  almost  entirely  university 
research  projects.  As  such,  they  are  not  concerned  with  risk  assessment  or  measuring 

*  Targets  repre^ts  known  implementations  of  the  languages  at  the  time  of  the  survey.  It  is  possiUe  that 
there  are  other  implementations,  either  not  known  at  the  time,  or  implemented  since  thm. 
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productivity,  or  with  the  engineering  of  truly  large  and  complex  systems.  Further,  for  many 
software  engineering  activities,  whether  the  problem  domain  or  target  hardware  is  inherently 
parallel  is  irrelevant  to  the  activity;  thus  technical  or  management  tools  for  parallel  software 
development  need  not  be  different  frcxn  those  for  ordinary  software  engineering. 

The  initial  matrix  was  split  into  3  tables.  The  first  mapped  tools  to  parallel  software 
engineering  problenss.  The  second  mapped  tools  to  quality  issues,  and  die  last  m^ped  tools 
to  management  issues. 


PARALLEL  SOFTWARE  ENGINEERING  PROBLEMS 


Parallel  software  quality  issues 


Parallel  Software  Management  problems 


■gjjggwgi 


2^  Discussion 


One  result  of  our  study  was  a  refonnuladon  of  the  matrix  develc^)ed  earlier  in  this 
project  We  expected  to  update  die  matrix  as  we  used  tools  to  reflect  addititxial  capabilities 
not  mentitmed  in  the  papers,  or  to  downgrade  ct^bilities  that  were  not  supported  to  the 
extent  we  expected,  llie  ori^nal  matrix  also  nested  revision  in  structure  for  three  main 
reasons.  First  some  of  the  "problems"  in  die  original  matrix  are  simply  phases  of  the 
software  develc^ment  cycle,  such  as  design,  and  are  not  specific  to  psoallel  software 
development  even  though  diere  may  be  some  unique  problems.  For  some  of  the  "problems" 
which  are  unique  to  parallel  software  development  it  seemed  important  to  identify  where  in 
the  development  process  these  problems  ne^  to  be  addressed.  Finally,  some  of  Ae 
"problems"  are  usually  determined  by  external  factors,  and  are  not  direcdy  addressable  by 
themselves.  We  also  classified  the  tools  in  terms  of  supporting  some  particular  technique, 
such  as  simulation,  that  is  used  to  help  die  de^loper  solve  the  problems.  The  new 
tooVproblem  solution  matrix  is  structured  in  terms  of  the  software  development  life  cycle, 
and  the  unique  problems  added  by  parallel  architectures  to  each  phase  of  the  process.  The 
following  sections  describe  the  new  matrix  structure,  and  present  an  updated  matrix  based 
on  the  tools  actually  evaluated. 

23.1  Revised  Matrix  Structure 

Of  the  ori^al  "problems".  Specification,  Design,  Code/Test,  and  Debug/Test  become 
life  cycle  activities.  In  the  new  matrix  they  toe  broken  down  into  Requirements  Analysis  and 
Definition,  High  Level  Design,  Low  Level  Dodgn,  In^lemmtation,  and  Testing  and 
Integration.  This  life-cycle  and  the  parallel  programming  "problems"  addressed  during  each 
phase  are  discussed  below. 

Requirements  Analysis  and  Definition  is  concerned  with  defining  the  application 
and  identifying  interfaces.  The  activities  associated  witii  this  phase  have  to  do  with  analyzing 
the  requirements  for  coaq)leteness,  consistency,  feasilnlity,  and  customer  acceptalnlity. 

These  activities  ate  not  necessarily  changed  if  the  azdiitecture  is  parallel  rather  than 
amventionaL  For  example,  user  interface  requirements  having  to  do  with  screen  layout 
would  not  influence  the  architecture  of  the  machine,  but  user  interfiice  requirements  having 
to  do  with  response  time  could  influence  the  architecture  if  it  became  clear  that  they  could 
only  be  met  with  a  parallel  implementation.  So  it  is  possible  that  all  requirements  analysis 
could  be  affected  by  having  a  parallel  architecture,  or  it  could  be  conq)letely  independent 
Ideally,  tins  phase  of  the  software  develc^mient  (xocess  would  be  completely  architecture 
independoit  unless  a  specific  architecture  was  listed  as  a  requirement 

Design  immediately  fdlows  requirements  analysis  and  definition,  and  in  sixne  cases 
the  division  can  be  blurred.  Consider  Ae  case  where  an  application  has  been  partitioned  into 
it's  major  processes  during  requirements  analysis  to  determine  feasibility.  Is  this  still 
requirements  analysis,  or  is  it  the  beginning  of  design?  Even  die  line  between  high  level  and 
low  level  design  can  be  Uurred.  For  the  purpose  of  this  study,  high  level  design  is  classified 
as  identifying  the  main  components  of  die  s^tware  and  the  interfaces  between  them,  while 
low  level  design  is  more  concerned  with  how  the  previously  identified  components  will  be 
implemented.  In  addition  to  the  normal  design  processes  Ol  decomposing  the  problem  and 
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defining  interfaces,  high  level  parallel  software  design  also  involves  Process  Partitioning, 
Data  Distributitm,  and  Data  Coordination.  These  are  defined  as  follows: 

•  Process  Partitioning  is  the  process  of  breaking  the  application  into  separate 

processes  that  could  be  p^ormed  in  parallel. 

•  Data  Distribution  refers  to  the  process  of  determining  how  data  should  be  divided 

among  processors.  Some  aspects  of  this  problem  vary  depending  on  the 
architecture,  and  may  be  influenced  by  the  process  partitioning.  For  example,  it  is 
more  imp(»tant  for  data  to  be  local  to  the  process  that  uses  it  in  a  message  passing 
MIMD  architecture  than  on  a  shared  memory  machine. 

•  £>ata  Coordination  is  die  process  of  identifying  the  communication  or 

synchronization  of  data  between  processes.  This  previously  came  urxler  the 
heading  of  CommunicatioiL 

During  low  level  design  the  high  level  design  problems  are  investigated  in  noore  detail, 
and  the  following  additional  problems  must  be  addmsed: 

•  Algorithm  Selection  may  be  determined  by  the  architecture  if  it  is  already  known,  or 

it  may  determine  the  t^ipropriate  architecture.  Ideally  the  algorithm  would  still  be 
architecture  independmt,  but  given  the  current  state  of  the  practice,  this  is  usually 
not  the  case. 

•  Data  Type  Determination  is  the  process  of  specifying  the  important  abstract  data 

fypes. 

•  Crntrol  SynchronizaticMi  refers  to  the  process  of  determining  necessary 

synchronization  between  processes.  Such  as  points  where  one  function  must 
complete  before  another  can  proceed.  This  also  came  under  die  Communication 
heac^g  in  the  old  matrix. 


It  should  also  be  noted  that  language  selection  must  be  done  during  the  design  phase. 
This  is  also  true  of  sequential  software  develt^iment,  but  for  parallel  software  development  it 
is  more  likely  to  be  determined  by  external  facmrs  such  as  availalnlity  on  the  target,  or 
availability  of  existing  source  co^.  One  problem  is  not  unique  to  pa^el  software  design, 
but  should  be  mentioned  because  none  of  the  parallel  design  tools  support  it  That  is  the 
production  of  review  materials.  Part  of  any  design  process  is  to  hold  a  design  review  where 
eiqierts  read  the  design  and  comment  on  it  It  is  necessary,  therefore,  to  produce  some  sort 
of  review  materials  for  die  reviewers  to  read.  This  is  lacking  in  all  of  the  tools  surveyed. 

Impleinaitatkm  is  probably  the  best  defined  portion  of  the  development  process.  In 
the  new  matrix  it  is  broken  down  into  coding,  linking/loading,  and  debugging. 

•  Coding  includes  both  writing  and  con^iling  the  tqrplication. 

•  Linking/Loaduig  refers  to  the  process  of  building  the  conqilete  qiplication  from  the 

conned  code  and  loading  it  on  the  parallel  target 

•  Debugging  covers  the  testing  of  inde^dent  pieces  of  code;  finding  and  fixing  bugs 

as  they  arise. 

In  general  purpose  computers,  linking/loading  is  usually  considered  to  be  the  last  step 
of  coding.  We  have  split  it  out  for  parallel  machines  because  the  linking  and  loading  of  a 
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paxaUd  application  usually  involves  some  mapping  of  processes  to  processor.  AU  of  the 
implementation  problems  spp\y  equally  to  sequential  programs,  however,  special  mols,  e.g. 
parallel  debuggers,  axe  need^  in  the  case  of  parallel  software. 

Testing  and  Litegratkm  is  the  part  of  die  process  where  all  the  pieces  of  the 
application  axe  brought  together  and  tested  as  a  whole  to  determine  whether  all  the 
requirements  are  met  The  testing  and  integration  problems  are  Validation  Testing,  and 
Perfomance  Measurement  and  Tuning. 

•  Validation  Testing  verifies  the  application  meets  requirements.  In  general,  validation 

testing  is  the  processor  of  measuring  the  reliability  and  ccnrecmess  of  the  software. 
Testing  parallel  software  involves  finding  additional  problems,  such  as  non- 
deterministic  bugs  or  deadlock  situatirxis. 

•  Performance  Measurement  and  Tuning  involves  finding  the  bottlenecks  in  the 

application  and  making  adjustments  to  die  application  to  alleviate  them  without 
creating  other  bottlenecks. 


The  remaining  problems  in  the  old  matrix.  Load  Balancing,  Computation  Replication, 
Number  of  Processors  and  Reuse  were  removed  for  the  following  reasons. 

Load  Balancing  divides  into  static  load  balancing  and  dynamic  load  balancing. 
Dynamic  load  balancing  has  to  be  implemented  at  the  operating  system  level.  It  is  not 
something  that  can  be  added  after  die  fact  by  a  mol,  and  it  is  not  addressed  directly  during 
the  development  process.  Static  load  balancing  is  addressed  during  the  development 
process,  but  is  covered  by  a  comMnation  of  die  other  development  problems:  performance 
evaluation,  process  partitioning,  and  data  distribution. 

Confutation  replication,  that  is  the  duplicatim  of  a  conf  utaticmal  unit  rai  mrae  than 
one  processor,  tends  m  be  language  dependent,  for  example,  via  task  types  in  Ada.  In 
languages  that  are  extensions  of  sequential  languages  it  is  sometimes  handled  by  an 
additional  tool  at  link  time  which  maps  processes  m  processors,  but  again  it  is  Iwguage 
dependent 

Number  of  processors  represented  the  problem  of  determining  the  ideal  number  of 
processors  for  a  given  application.  This  was  goierally  not  a  concern  during  the  development 
process.  Either  the  number  of  processors  was  a  known,  limited  number,  in  which  case  the 
challenge  was  m  design  die  most  efficient  algorithm  fnr  that  number,  or  die  number  of 
processors  was  variable,  and  the  solution  was  designed  in  terms  of  an  unlimited  number  of 
"virtual  processors"  which  could  then  be  mapped  to  die  actual  number  of  processors  when  it 
was  known.  The  latter  was  also  die  best  apfxoach  to  take  when  concerned  about  scalability 
as  it  is  easier  to  map  a  greater  number  of  virtual  processors  to  fewer  actual  processors,  than 
it  is  to  break  up  an  algorithm  that  was  designed  with  a  given  number  of  processors  in  mind. 

Reuse  could  be  classified  in  two  of  ways.  Hrst  is  the  ability  to  reuse  existing  code 
firagments,  "dusty  decks",  as  is,  in  a  parallel  framework.  This  is  usually  dependent  on  the 
language  and  whether  it  is  based  on  an  existing  sequential  language.  Second  is  the  ability  to 
write  code  that  can  be  reused  on  different  parallel  architectures.  This  is  more  of  a  quality 
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issue,  and  it  depends  on  whether  this  is  an  in^x»tant  ccmsideration  for  the  application  in 
question.  Quali^  issues  are  discussed  below  in  section  2.4. 

In  addition  to  identifying  the  parallel  software  development  problems  addressed  by  a 
particular  tool,  we  identify  whether  a  particular  problem  solving  technique  (such  as 
simulatitm)  is  eixq)loyed.  In  sonre  cases  a  tool  <toes  not  really  focus  c«i  any  one  particular 
problem,  but  supports  a  particular  technique  which  could  be  used  by  the  developer  to 
analyze  a  problem. 

2JJ  Revised  TooVProblem  Solution  Matrix 

The  revised  matrix  includes  only  tools  which  were  evaluated  in  noore  dq}th  during  this 
project  This  includes  tools  which  were  used  during  the  development  of  the  sample 
problems,  or  tools  for  which  we  had  the  user  documentation  in  addition  to  the  published 
paper. 


In  the  matrix  indicates  a  problem  that  is  intended  to  be  solved,  't'  indicates  a 
problem  which  the  tools  does  not  claim  to  solve,  but  to  which  we  tried  to  apply  it  because 
there  seemed  to  be  a  possibility  the  tool  would  help. 

Note:  Languages  are  now  included  (xi  the  list  to  the  extent  that  a  particular  compiler  or 
interpreter  was  available  on  the  PEEP  prototype.  See  section  4  for  more  detailed  evaluations 
of  the  tools. 

2.4  Further  Development 

Based  on  the  updated  Tool/Problem  Soluticxi  Matrix  it  appears  that  future  matrices 
should  also  try  to  capture  quality  issues.  That  is,  how  the  tool  or  technique  improves  the 
quality  of  the  resulting  application.  To  some  extent  the  quality  of  the  application  depends  on 
the  requirements.  For  example,  architecture  portability  may  not  be  an  imprxtant 
consideradon  is  the  rqiplicadon  is  requited  to  run  on  mly  a  single  target  architecture.  In  this 
case  a  non-portable  implementation  is  not  necessarily  a  low  quality  result 

Therefore  further  matrices  should  try  to  capture  the  support  of  the  following  system 
"iUtics": 

Maintoinabiiity 

EvdvabiliQr 

Scalability 

Architecture  Portability 

Reusabiliiy 

Productivity 

Note,  efficiency  is  not  on  this  list  as  presumably  the  main  reason  for  using  a  parallel 
architecture  is  to  improve  efficiency  or  throughput  As  a  result  all  tools  that  aid  in  parallel 
software  development  could  be  said  to  improve  efficiency. 
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3  PEEP  Configuration 

In  diis  section  we  provide  an  overview  of  the  PEEP  configuration  and  the  criteria  that 
led  to  tills  configuratirm. 

3.1  Architecture 

The  PEEP  is  designed  to  be  an  evolvable  configuration  of  hardware  and  software, 
such  that  it  is  both  immediately  usable,  relatmly  robust,  and  yet  open.  The  hardware  base 
which  has  been  selected  consists  of  well-established,  off-the-shelf,  mature  “stock” 
hardware. 

Sun  workstations  form  the  main  platform  of  the  PEEP.  Two  Suns  were  procured  for 
the  project,  a  Sun4/330  and  a  Sun4/470.  These  workstatiiMis  include  color  monitors  and  a 
graphics  accelerator  in  txder  to  provide  full  color  grE^tiiics.  The  intent  of  selecting  these 
machines  is  to  support  the  bulk  of  software  development  In  addition  to  the  Suns,  the  PEEP 
will  incorporate  an  Apple  Macintosh  ccanputer,  to  utilize  that  system’s  sophisticated  user 
interface,  while  integrating  it  with  the  Sun  “backbone”  which  act  as  file  server.  At 
Rome  Laboratory,  the  Sun4/470  will  be  a  back-end  server  to  a  network  of  Apple 
Macintoshes  connected  via  Ethernet  The  Macintoshes  serve  as  multi-user  fmt  ends  for  the 
PEEP  and  also  be  used  to  perfnm  data  analysis  research  results  and  maintain  the  PEEP 
documentation.  The  architecture  is  shown  in  the  figure  below. 

The  Meiko  In-Sun  Computing  Surface  was  installed  in  both  Sun  workstations.  It 
provides  a  scalable,  multi-processor  distributed  memory  architecture.  Each  individual 
processor  executes  sequential  code  using  its  own  dedicated  memory.  Processors 
communicate  via  efficient,  high  speed  inter-process  message  links.  Each  computing  surface 
contains  16  processors  arxl  is  configurable  in  a  number  of  different  ways.  This  flexibility  in 
the  configuration  will  allow  the  Suns  to  be  used  to  help  determine  the  most  efficient 
processor  ccxifiguration  for  a  given  parallel  application.  Computing  surfaces  may  be 
combined  using  the  expansion  slots  on  the  Suns,  so  noore  boards  may  be  easily  added  in  the 
future  to  further  expand  the  parallel  capabilities  of  the  Suns. 
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3.2  Design 

As  the  tools  being  evaluated  on  die  PEEP  will  change  over  time,  it  is  important  to  have 
a  solid  foundation  on  which  to  build.  A  primary  goal  of  the  PEEP  design  is  openness.  The 
PEEP  should  be  open  to  hosting  new  tools,  languages,  and  methods.  Also,  given  its  primary 
role  as  a  platform  for  experimentation  with  software  tools  from  so  many  diverse  sources,  it 
must  be  easy  to  host  as  many  of  these  as  possible.  For  this  reason.  Suns,  running  UNIX 
and  X,  and  Macintoshes,  ruruiing  X,  were  an  ideal  choice  for  hosting  die  diverse  collection 
of  university  and  commercial  tools  idoitified  by  our  earlier  study. 

There  will  be  a  core  set  of  tools  on  the  PEEP  uptm  which  the  test  of  the  syston  will  be 
built  Sun  workstations  are  equipped  with  a  number  of  tools  that  naturally  become  a  pan  of 
the  core  PEEP.  The  Suns  tun  the  UNIX  operating  system,  which  has  rapidly  become 
accepted  as  a  de  facto  standard  operating  system.  Odier  tools  which  are  packaged  with  the 
workstation  include  sequential  language  compilers  and  debuggers,  a  program  building 
facility  (make),  source  configuration  management  tools,  and  a  numbCT  of  editors  and  other 
text  processing  tools. 

Early  in  the  design  process,  we  considered  the  possibility  of  recommending  a  multi¬ 
processing  operating  s}rstem  like  Cronus  or  Mach.  In  the  long-run,  diis  may  be  an 
evolutionary  goal,  however,  at  present,  most  of  the  tools  with  which  we  have  peculated  the 
PEEP  run  under  UNIX  (Sun  OS),  and  it  remains  the  best  choice  for  this  time. 

Layered  on  top  of  Sun  OS  is  Sun’s  Open  Windows  graphical  user  environment.  This 
is  based  on  the  X  Window  System  from  MTT.  Tools  which  require  basic  X  facilities  run 
without  ixxxlirication  under  Open  Wuidows,  as  will  tools  developed  specifically  to  exploit 
Open  Windows. 
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Open  Windows  provides  the  most  basic  level  of  tool  integration  from  the  user’s 
viewpoint  Tools  resident  on  the  PEEP  are  accessible  to  users  via  a  special  PEEP  menu  that 
has  bem  added  to  the  default  workspace  menu  provided  by  Open  Windows. 

Further  integration  is  provided  by  FIELD  from  Brown  University.  FIELD  was 
developed  as  a  framework  for  integrating  software  developnoent  tools.  Using  FIELD'S 
message  passing  system,  indqrendent  parallel  software  develc^ment  tools  could  share 
information.  Another  tool  from  Brown  University,  GARDEN,  provides  a  noechanism  for 
supporting  multiple  parallel  languages  for  ptomty;wg  and  language  evaluation. 

3,3  Prototype  Implementation 

To  implement  die  PEEP  prototype  the  various  hardware  components  described  in 
section  3.1  were  purchased.  The  main  task  then  was  to  populate  the  PEEP  with  promising 
parallel  tools  and  languages  identified  in  the  survey.  Tl^  was  accomplished  in  stages  and 
tool  acquisition  ctmtinued  throughout  the  project 

Sun  workstations  were  chosen  as  the  platform  because  many  of  the  surveyed  tools 
would  run  (xi  them,  however,  there  were  some  tools  diat  were  not  inqilemented  for  the  Sun. 
These  tools  were  immediately  excluded  from  the  list  of  uxds  to  be  considered  for  the  PEEP 
as  porting  any  tool  to  a  new  tar^t  was  conadered  outside  the  scope  of  this  project 

Of  the  remaining  tools  we  focused  on  the  ones  that  were  the  noost  language  or  target 
independent  or  that  at  least  supported  multiple  targets  or  languages.  Acquisition  of  a  tool 
usually  involved  the  following  steps: 

1 .  Contact  die  main  developer  to  determine  current  status  of  the  project 

2.  Determine  whether  Sun  workstation  inqilemoitation  of  die  tool  is  availaUe. 

3.  Get  pricing  information  and  licensing  agreement 

4.  Negotiate  license  agreement 

5.  Complete  purchase  and  receive  delivery. 

In  some  cases  the  software  was  freely  available  and  accessiUe  via  the  internet  Step  3 
in  that  case  becomes  a  simple  case  of  dowtiloading  the  software  over  the  internet  and  steps 
4  and  5  are  not  necessary.  The  results  of  our  attempts  to  get  each  tool  are  documented  in  the 
table  below. 


Tool 

Contact 

Install 

Problems 

Comments 

"'Lisp  Sim 

Thinking 

Machines 

Yes 

♦Lisp  Subset 

BALSA  n 

Mark  H.  Brown 

Yes 

Macintosh  Host  only 

Demo  disk  installed 

CODE/ROPE 

James  Browne 

Yes 

Todc  over  6  months 
to  get  software 

New  version,  CODE 
2.0,  will  provide  better 
support 

Crystal 

Marina  Chen 

Yes 

Ccm^iler  for  iPSC/2 
target  never  available 

Interpreter  for  Sun 
host  used 
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Tool 

Contact 

Install 

CSTools 

Meiko 

Yes 

DINO 

Roberts. 

Schnabel 

No 

EXPRESS 

Parasoft 

No 

Faust 

Hammerslag 

No 

FORCE 

Harry  F.  Jordan 

No 

Hypertasking 

Marc  Baker 

No 

Id 

Arvind 

Yes 

IPS-2 

Barton  Miller 

No 

ISSOS 

Karsten/Schwan 

No 

Linda  (Q 

David  Gelertner 
Scientific 
Confuting 
Associates 

No 

Molecule 

Kai  Hwang 

No 

Olympus 

Gary  Nutt 

No 

PADWB 

Gould 

No 

PARET 

Nichols/Edmailc 

No 

PAT 

Kevin  Smith 

No 

PAWS 

Dan  Pease 

Yes 

PCN 

Yes 

PIE 

2:aty  Segall 

No 

Problems 

Comments 

Compiler  and  tools 
supplied  with  Meiko 
transputer  board 

Unable  to  contact 
develq)er 

Gxnmercial  product 
similar  to  CSTools 

Tools  no  longer 
supported 

Did  receive  tape,  but 
unable  to  inst^ 
software 

Product  not 
suppmted 

Unsupported  Intel 
serftw^ 

iPSC/2  host  only 

Took  over  2  noonths 
to  negotiate  license 

Initial  version 
required  Common 
Li^  to  be  installed 

VAX  hosted  only 

Started  a  Sun  port, 
but  never  finished 

Unable  to  contact 
devel(^)ers 

Project  ended 

Research  became 
commercial  product 

Unable  to  contact 
developer 

Incompatible  with 
Xwindows 

Received  license 
agreement 

Unable  to  contact 
developers 

Appears  published 
pi^  may  have  been 
proposal  rather  than 
results 

Unable  to  contact 
developers 

FORTRAN  specific 

Freely  available  via 
ftp 

Not  available  until 
end  of  project 

Only  beta  version 

Not  available  until 
end  of  project 

Available  via  ftp  over 
internet 

Unable  to  contact 
developer 

-17- 


Tool 

Contact 

Install 

Problems 

Comments 

PISCES 

Terry  Pratt 

No 

FLEX/32  host  only 

Project  on  hold  not 
likely  to  restart 

POKER 

Larry  Snyder 

Yes 

Toc^  over  6  months 
to  negotiate  license 
agreement 

Sun3  version  only 

POSYBL 

No 

Requited  network  of 
Sun  workstations 

Free  implementation 
of  Linda  for  network 
of  workstations 
available  via  fQ) 

PProto 

ISSI 

No 

Required  ONTOS 
DBMS 

PProto  available  from 
Rome  Laboratory 

Prometheus 

Sean  Arthur 

No 

Project  abandoned 

New  project 
TaskMaster 

STRAND-88 

Tim  Mattson 

No 

Commercial  parallel 
language 

TANGO 

Steve  Reiss 

Yes 

Availatde  via  ftp 

TaskMaster 

James  D.  Arthur 

No 

VAX  hosted  only 

TOPSYS 

Thomas 

Bemmeri 

No 

iPSC/2  host  specific 

A  number  of  problems  arose  during  this  process: 

•  Many  tools  were  lost  because  the  developer  had  left  the  company  or  graduated.  In  at 

least  one  case  the  current  employees  had  never  heard  of  the  tool.  In  another  case  the 
tool  was  available,  but  without  documentatitxi,  aixl  in  an  incomplete  state. 

•  Universities  ate  often  not  equipped  to  distribute  the  results  of  their  research  outside 

the  academic  coomiunity.  The  best  exanq>le  of  this  is  the  licensing  agreements 
supplied  by  the  Universities.  It  was  usuaUy  necessary  to  make  substantial  changes 
to  the  agreement  to  make  it  acceptable  for  a  commercial  company,  and  in  particular 
to  make  it  available  to  die  government  It  often  took  months  to  negotiate  the  license 
agreements. 

•  The  tool  might  not  exist  because  the  published  paper  rqnesented  future  plans  that 

were  not  irrqilemented  due  to  never  receiving  the  funding  support 

•  Software  support  varies  a  great  deal.  Some  ongoing  projects  are  only  too  happy  to 

provide  support  when  they  leam  that  a  third  party  is  using  their  mol.  In  other  cases 
the  software  is  delivered  "as  is"  with  no  documentation,  no  installation  instructions, 
incomplete  sources  and  no  support 

•  In  rare  cases  we  simply  got  no  response  at  all,  even  after  several  phone  calls  or  ft- 

mail  messages. 


33.1  Integration 

There  were  also  problems  with  the  planned  integration  approach  of  using  FIELD  and 
GARDEN. 
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First,  the  tools  we  acquired  were  too  independent  or  self-contained  to  be  integrated 
using  FIELD.  Often  the  tools  were  based  on  completely  different  assuiiq)ticHis  making  the 
sharing  of  information  meaningless.  Also,  the  type  of  integration  that  would  have  been  of 
most  use  was  "sequentiaL"  That  is,  the  output  of  one  tool  could  have  been  useful  as  the 
input  to  another.  FIELD  is  designed  for  more  '’parallel”  integration,  where  two  tools  are 
running  at  die  same  time  and  communicate  when  changes  are  made.  For  exan^ile,  imagine  a 
syntax  directed  editor  for  a  parallel  language,  and  a  tool  which  does  some  sort  of  static 
analysis.  If  the  editor  broadcasts  when  a  change  is  made,  the  analyzer  can  update  its  analysis 
based  on  die  change. 

Second,  GARDEN  was  not  robust  enough  for  heavy  use,  and  restrictions  on  input  and 
ouqiut  of  the  source  form  of  a  program  made  it  difficult  to  use  in  the  way  we  had  intended. 

It  is  also  the  case  that  research  on  GARDEN  has  stopped  for  die  moment  while  the 
developers  cmcentrate  on  FIELD.  As  a  result  it  is  not  likely  to  in^nove  in  the  near  term 

However,  we  did  integrate  one  toed  using  die  FIELD  mechanism.  TANGO  is  an 
algcaithm  animatiexi  tool  that  was  also  devele^ied  at  Brown  University.  It  was  already 
integrated  with  FIELD  in  its  sequential  form.  In  order  to  use  TANGO  with  the  transputer 
we  ported  the  target  specific  portions  of  TANGO  to  the  transputer.  Once  that  was  con^lete 
it  was  possible  to  animate  a  program  running  on  die  transputer  with  the  display  on  the  Sun 
workstation.  For  further  evaluation  of  TANGO  see  sectim  4. 

3.4  Evolution 

Specific  recommendations  for  the  evolution  of  the  PEEP  are  given  in  section  5. 

Above  die  level  of  individual  tools,  it  would  be  desirable  that  the  degree  of  integration 
evolve  toward  greater  tool  integration  and  interoperability.  This  could  be  achieved  via: 

•  a  common  intermediate  language  at  the  virtual  machine  interface 

•  common  program  database  facilities 

•  cornmon  iriformation  capture  capabilities 

FIELD  could  still  be  used  as  the  integration  mechanism  in  some  cases,  but  a  common 
data  fcamat  is  needed  to  provide  the  "sequential"  integration. 
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4  Experimentation  and  Evaluation 

We  ccRiduc^  a  series  of  experiments  in  parallel  development  The  purpose  of  these 
experiments  was  two-fold:  (1)  to  evaluate  the  functitmality  and  usability  of  the  tools  and 
techniques  which  had  been  selected  for  the  PEEP,  and  (2)  to  investigate  the  larger  issues  of 
“parallel-piDgramming-in-the-large”:  develq)ment  methodology,  tool  and  technique 
oxnbinations,  and  life  cycle  concerns. 

Based  on  the  survey,  we  determined  that  not  all  life  cycle  areas  were  equally  well 
supported.  Current  tools  supporting  such  activities  as  requirements  analysis  and  functional 
testing,  for  exan^le,  are  somewhat  limited  in  scope.  In  many  cases,  this  reflects  the  fact  that 
there  has  been  limited  success  in  these  activities  for  all  kinds  of  software  development — not 
just  f(x:  parallel  architectures.  Recognizing  the  practical  limitations  of  available  resources,  we 
chose  to  ctxicentrate  our  experimentation  on  those  areas  where  there  is  die  most  activity  and 
potentially  the  most  leverage  to  be  gained  by  applying  the  best  available  technology.  Active 
areas  include  parallel  languages,  parallel  develqmient  environments,  and  overall  analysis 
techniques  early  system  level  performance  simulation  to  monitoied  execution. 

This  chapter  is  orgaiuzed  to  hi^ilight  titese  methodological  and  life  cycle  issues.  In 
the  next  section,  we  describe  the  three  parallel  programming  problems  we  investigated. 
Subsequent  sections  address  the  full  sequeitee  of  Requirements,  Design,  G)ding,  etc., 
following  the  SEPA  Problem  Space  matrix  we  described  in  Chiq)ter  2. 

4.1  Problem  Characteristics 

We  identified  three  parallel  programming  problems  for  experimentation  on  the  PEEP. 
Each  was  selected  to  investigate  certain  issues  in  parallel  programming.  Our  emphasis  was 
not  on  creating  new  parallel  algorithms — there  is  already  plenty  of  work  on  this  in  the 
literature.  Rather,  our  en[q)hasis  was  on  the  engineering  of  relatively  well-understood 
algorithms  within  a  parallel  programming  context  This  allowed  us  to  focus  on 
methodology,  Ufe  cycle  concerns,  and  tool/technique  usability. 

The  first  problem  diosen  for  study  was  parallel  sorting.  Scnrting  is  a  fundamental 
computing  problem  invariably  present  in  any  large  C^,  or  information  systems  application. 
In  a  previous  smdy  done  for  Rome  Laboratory,  Coiiq}uter  Sciences  Corporation  identified 
such  C^  functions  as  pattern  analysis  and  database  maintenance  where  sorting  is  a  potential 
component  [Lightfoot  era/.,  1990].  Since  sotting  in  the  sequential  case  is  so  weU-studied, 
we  were  interested  in  moving  existing,  sequential,  algorithms  to  a  parallel  architecture  such 
that  any  inherent  parallelism  could  be  exploited.  The  "porting”  of  sequential  algorithms  to 
parallel  architectures  is  a  form  of  parallelization  which  has  been ,  and  is  likely  to  continue  to 
be  quite  common. 

The  second  problem  chosen  was  parallel  searching.  Searching  arises  in  many  database 
retrieval  and  real-time  pattern  analysis  functions.  It  is  critical  in  large  databases,  in  target  and 
threat  analysis,  and  in  weapons  management  and  assessmoit  activities.  [Lightfoot,  et  al., 
1990]  document  its  use  in  more  than  half  of  C^  functions.  In  a  realistic  environment, 
searching  must  take  place  where  there  are  possibly  multiple  readers  while  a  writer  is 
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updating  the  search  space  with  new  data.  We  chose  this  problem  to  investigate  “cooperative 
concunency”  wherein  there  is  a  degree  of  explicit  cooidinadon  —  a  style  quite  different 
from  that  implied  by  the  inherent  parallelism  of  the  previous  problem.  Furthermoe,  the 
problem  may  be  tackled  at  various  granularities  from  single  element  locking  to  large-grain 
tasking. 

Image  warping  enables  two-dimensional  data  to  be  presented  from  a  three- 
dimensional  perspective  determined  by  a  possibly  varying  angle  of  visibility.  It  is  useful  in 
analyzing  tt^graphical  data  and  managing  dtt  display  of  terrain  data  in  modem  C^I 
battlefield  support;  it  could  also  be  en:q}loyed  for  rustic  simulations.  Image  warping  is  an 
example  of  a  more  compute-intoisive  applications  than  sorting  and  searching,  and  involves 
enough  numeric  and  graphics  issues  to  raise  an  orthogcxial  set  of  parallel  programming 
issues.  Image  warping  is  naturally  modeled  on  a  SIMD  machine — ^we  in^lemented  it  on  the 
Transputer,  a  MD^  architecture.  Because  SIMD  architectures  are  highly  machine  specific, 
we  were  interested  in  seeing  whether  we  could  achieve  a  “target-independent”  foimuladon 
of  the  problem.  This  expeiiment  also  enabled  us  to  invesdgate  communicadon  and  data 
granul^ty  trade-offs.  Lasdy,  the  system  was  connected  to  an  X  display,  allowing  us  to 
consider  die  requirements  for  synchronizadon  necessary  to  yield  a  coherent  image. 

Appendix  A  provides  addidonal  desoipdons  of  the  three  experiments. 

4.2  Approach 

Our  overall  approach  was  to  carry  out  our  expoimental  development  acdvides  on  the 
prototype  PEEP— ^monstradon  of  the  efficacy  of  coordinated  use  of  tools  on  a  powerful 
woikstadcm  with  access  to  various  parallel  hardware  architectures  was  a  primary  goal  of  the 
prototype  PEEP.  Tools  were  select^  to  each  cover  a  relatively  qiecific  part  of  the  life  cycle, 
or  suppon  function.  Since  this  selection  did  not  address  the  question  of  commoi 
functionaliQr,  a  remaining  area  of  consideratitm  was  common  services  that  can  be  shared  to 
enhance  overall  platform  effectiveness. 

Initial  versions  of  all  experiments  were  programmed  in  C  and  targeted  to  a  16  node  in- 
SUN  Transputer  board  attached  to  the  PEEP  SPARCStation.  The  Meiko  CSTools  were 
used  for  the  first  phases  of  work  on  all  the  problems.  We  were  also  able  to  target  one  of  the 
problems  ro  the  Intel  iPSC/2,  but  were  not  able  to  get  any  timing  results  due  to  space  and 
memory  limitati(»s. 

The  tools  acquired  for  the  PEEP  to  be  evaluated  as  a  part  of  our  experiments  are  listed 
below.  Summaries  of  diese  and  all  the  tools  mentioned  in  this  report  are  provided  in 
Appendix  B.  The  tools  evaluated  are  divided  into  two  groups:  tools  which  we  attempted  to 
use  for  the  experimoits;  and  tools  which  we  installed,  or  ttied  to  install,  but  were  not  able  to 
use  for  the  expoiments  for  various  reasons. 

Used  for  experiments: 

CODE-  parallel  CASE  tool  (Uruversity  of  Texas,  Austin) 

CSN.niustrate  -  visualizing  Transputer  configurations  (Intermetrics) 

CSTools  -  C  and  FORTRAN  confer  system  for  transputer  (Meiko) 

Id  -  dataflow  language  (MTT) 


♦Lisp  -  SIMD  version  of  Lisp  (Thinking  Machines) 

-  parallel  requirements  tool  (RL/ISSD 

Not  used  for  experiments: 

Crystal  -  language  with  spedai  index  set  notation  (Yale) 

PAWS  -  performance  assessment  tool  (RlVSyracuse  University) 

PCN  -  hierarchical  graphical  language  (Argonne  Labs) 

POKER  -  parallel  environment  (University  of  Washington) 

TANGO  -  support  of  algorithm  animation  (Brown  University) 

Recommendations  regarding  these  ttx}ls  are  included  in  section  5. 

4^  Requirements  Analysis  and  Definition 

Within  Requirement  Analysis  and  Definition,  one  is  concerned  with  defining  the 
problem  to  be  solved,  and  the  constraints  on  tiw  solution.  The  result  of  this  activity  is  a 
requirements  defirtition  which  may  be  analyzed  for  conq)leteness,  consistency,  realizability 
and  customer  acceptability.  The  requirements  (tefinition  determines  the  functiraiality  of  the 
system  to  be  built  Insofar  as  possible,  the  requirements  specification  should  be  as  free  fitxn 
design  and  implementaticm  considerations  as  possible.  So,  in  one  sense,  the  issue  of 
whether  a  problem  is  to  be  solved  using  a  parallel  architecture  is  irrelevant  in  Requirements. 
This  argues  against  tire  need  for  particular,  parallel  architecture-orioited  requirements  tools. 
While  the  state  of  the  practice  in  requirements  tools  and  techniques  is  somewhat  immature, 
ritost  decent  requirements  techniques  do  not  presume  or  enforce  a  sequential  model  of 
conq)utation  (c,g.,  SADT,  IDEF,  SREM). 

Due  to  the  limited  scope  ci  our  investigations — three  problems  relatively  well-defined 
by  existing  algoritiuns — we  did  not  undertake  anything  which  approached  a  formal 
requirements  activity  for  die  three  problems. 

However,  we  did  use  Id  as  a  hi^-leveL  architecture-independent  executable 
specification  language.  In  the  experiments  reported,  these  spet^cations  were  only  at  the 
algorithrmc  level,  but  the  same  style  of  specification  could  be  used  nooie  abstractly  in  the 
specification  of  a  larger  system.  While  we  found  tite  dataflow  model  embodied  by  Id  to  be  a 
very  architecture-independent  form  of  expression,  dicre  are  some  drawbacks  to  the  dataflow 
model  which  could  be  a  problem  for  large  systems.  In  particular,  the  lack  of  a  notion  of  sute 
requires  operations  to  be  parameterized  by  each  object  they  operate  on  (see  Id  specification 
of  Search  in  A.2).  This  defeats  uiformation  hiding — readability — in  requirements 
specifications,  and  (as  discussed  below)  introduces  communication  problems  in 
implementations.  W’s  dataflow  model  also  does  not  permit  the  easy  expression  of 
communication  requirements. 

We  had  expected  that  *Lisp  would  play  a  similar  high-level,  specification  role,  but  we 
had  limited  success  using  the  ♦lisp  simulator  provided  by  Thinking  Machines  as  only  a 
subset  of  ♦Lisp  is  supported.  The  ♦Lisp  simulator  is  intended  to  be  a  learning  tool,  not  as  a 
serious  development  tool. 
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PProto  is  intended  to  support  requirements  analysis,  but  it  was  not  used  for  the 
experiments  as  it  was  not  available  at  tte  time. 

4.4  Design 

Design  is  generally  bndcen  into  High-Level  and  Detailed  (or,  Low-Level)  E>esign.  In 
Ifigh-Level  Design,  the  problems  penain  to: 

•  Process  Partitioning 

•  Data  Distributicm 

•  Data  (Coordination 

In  Detailed  Design,  tiie  issues  are: 

•  Algorithm  Selectitm 

•  Data  Type  Determination 

•  (Control  Synchronization 

The  tools  that  were  used  during  design  are  Id  and  PProto. 

Our  first  experiment  did  not  use  any  parallel  design  tools,  as  none  were  available  on 
the  system  at  the  time.  Instead  we  used  what  turned  out  to  be  a  trial  and  errcr  approach.  The 
initial  design  for  the  stnting  algorithm  was  m  divide  the  data  into  as  many  chun^  as  there 
were  processors  available  on  the  tranr  j)uter,  sort  each  chunk  using  a  fast  sequential 
algorithm  on  each  transputer,  and  merge  the  sorted  chunks.  The  communicaticm  between 
processors  is  shown  in  the  diagram  below. 


The  rationale  behind  such  a  design  was  that  this  would  have  the  maximum  number  of 
processors  working  on  the  problem  at  the  same  time.  The  results  of  this  sorting  algorithm 
were  quite  poor  as  die  partitioning  of  the  data  into  chunks,  and  passing  it  out  to  the 
indivi^al  processors  created  commutucation  bottlenecks. 

We  later  turned  to  a  dataflow  approach  to  design.  This  was  supported  by  the  Id 
language  and  analysis  tools.  This  involved  demonstrating  in  Id  (see  A.  1)  that  the  quicksort 
algorithm  itself  had  some  inherent  parallelism.  We  wrote  the  quicksort  algorithm  in  Id,  and 
ran  the  Id  interpreter  displaying  a  graph  of  parallel  arithmetic  operations.  It  was  difficult  to 
interpret  the  graph  as  there  was  no  way  to  tell  what  was  going  cm  at  any  given  point  in  time. 
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but  it  did  seem  to  indicate  that  the  quicksort  algoridun  itself  could  be  exploited  for  some 
paTallelistn.  This  was  implemented  on  the  transputer,  and  the  results  were  substantially 
better  than  the  first  sort  implementation.  The  basic  algorithm  was  to  take  the  data  and  divide 
it  into  two  lists,  one  greater  dian,  one  less  than,  a  given  component  The  two  lists  are  passed 
to  two  processors,  while  die  first  processor  waits  for  their  results.  Their  results  are  two 
sorted  Usts  which  can  singly  be  concatenated  to  form  the  final  list  This  algorithm  is 
repeated  at  each  node  until  thoe  are  no  more  nodes,  at  which  time  the  remaining  elements 
are  sorted  using  a  sequential  sort  This  approach,  shown  in  the  diagram  below,  was  not 
considered  when  designing  the  quicksort  by  hand  because  it  appears  to  make  poor  use  of 
the  available  processors. 


At  the  point  where  the  sequential  sort  takes  place  (the  botlow  row  of  processors)  half  of  the 
processors  are  simply  waiting  for  results,  ai^  only  half  of  die  processors  are  working  on 
the  problem.  However,  the  use  of  Id  as  a  design  mol  led  us  to  diis  counter-intuitive  solution 
which  turned  out  to  be  quite  effective  as  the  results  shown  below  denxxistrate. 
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We  also  tried  to  use  the  Id  dataflow  design  approach  to  die  search  experiment  We 
were  able  to  try  out  a  number  of  approaches.  At  diis  time  we  received  a  new  version  of  Id 
that  allowed  the  performance  graphs  to  use  color  in  various  ways.  We  were  able  to  assign  a 
color  to  each  Id  fiinctitxi  in  our  program,  and  when  the  arithmetic  qieratiais  graph  was 
displayed,  the  graph  was  colored  according  to  how  may  qiaations  a  given  function  was 
running  in  that  cycle.  This  color  coding  of  the  graph  was  very  helpful,  and  also  served  to 
show  us  that  often  our  interpretation  of  the  graphs  had  been  wrong.  In  die  black  and  white 
graphs  we  would  tty  to  associate  a  peak  in  the  graph  with  something  we  knew  was 
happening  in  the  algorithm.  When  we  used  the  color  graphs  we  would  discover  that  a 
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function  quite  different  from  the  one  we  were  expecting  was  contributing  to  the  peak  in  the 
graph.  Thus,  die  color  version  of  Id  is  much  more  useful  than  die  older  version. 

A  drawback  of  the  Id  design  ^preach  with  respect  to  the  second  experiment  was  the 
fact  that  as  a  very  fine  grained  approach  to  parallelism,  it  did  not  leaUy  help  with  our  goal  of 
investigating  synchronization  issues.  Id  performs  its  locking  at  a  very  low  level,  so  the  type 
of  explicit  synchronization  that  one  might  need  in  an  Ada  implementation,  for  example,  are 
missing  in  the  Id  implementation.  Note  that  this  does  not  necessarily  mean  that  an  Id 
specification  is  not  a  good  way  to  start,  but  just  diat  it  did  not  support  our  goal  of 
investigating  synchronization  issues.  In  general.  Id  provided  a  good  way  to  investigate 
algorithms,  but  it  is  not  designed  to  provide  any  kind  of  feel  for  what  the  ccnnmunication 
costs  would  be  once  it  was  implemented  in  a  non-dataflow  language  cm  actual  hardware. 
Unfortunately,  in  ^  three  experiments  die  communication  costs  were  a  large  factor  in  the 
poor  performance  of  early  versions  of  each  program.  It  appears  diat  communicaticxi  costs 
are  something  that  sofrwtue  engineers  inexperienced  with  parallel  programming  have 
difficulty  predicting. 

One  approach  we  tried  for  investigating  communication  costs  was  to  use  FProto. 
PProto  is  not  intended  for  use  as  a  design  tool,  but  since  it  should  he^  with  analyzing  die 
feasibility  of  requirements,  it  t^peared  that  it  could  help  to  predict  communication 
botdenecks.  This  analysis  actu^y  tock  place  after  having  implenmted  the  image  warping 
algorithm  a  number  of  ways  for  the  tran^uter.  However,  our  premise  was  that  if  PProto  had 
been  available  earlier  it  would  have  saved  us  from  iix^lementing  some  of  the  image  warping 
versions  by  predicting  the  communication  bcxtlenecks,  and  promoting  us  to  investigate  other 
approaches.  In  prototyping  the  frrst  versiem  of  the  image  warping  algorithm,  which  suffered 
from  a  severe  I/O  bottleneck,  PProto  did  predict  some  of  the  communication  costs.  We  were 
also  able  to  analyze  what  would  have  happened  if  more  processors  were  available. 

While  PProto  was  generally  useful,  there  were  two  places  where  PProto  could  have 
provided  more  functionality.  In  order  to  prototype  the  ima^  warping  algorithm  we  first  had 
to  model  the  transputer  board.  This  was  quite  time  consuming,  and  PProto  would  be  greatly 
enhanced  if  it  had  more  architectures  built  in  than  it  does  now.  This  would  probably  help 
during  requirements  analysis  too;  particularly  if  a  specific  architecture  is  a  requirement. 
Additionally,  the  level  of  hardware  characterization  currently  sui^x)rted  is  not  detailed 
enough  to  base  a  hardware  decision  on.  Possibly,  if  PProto  had  tetter  hardware 
characterization  facilities,  it  could  be  used  for  deciding  which  machine  to  buy. 

The  other  place  where  we  ran  into  problems  with  PProto  was  when  we  wanted  to 
create  a  variation  on  an  existing  design.  This  was  a  problem  at  two  levels.  I^t,  no  way  to 
create  a  copy  of  a  whole  existing  prototype  was  identified  in  the  PProto  documentation.  As 
a  result,  the  only  way  we  could  create  a  variation  was  to  either  change  die  existing  prototype, 
or  create  a  new  one,  and  duplicate  all  the  work  of  creating  the  original.  Secondly,  creating 
the  new  variant  was  harder  because  die  facilities  for  copying  existing  components  is  limited. 
For  example,  when  prototyping  the  transputer  architecture  most  of  the  nodes  are  almost 
identical.  It  would  have  been  easier  to  create  difierent  variants  of  the  transputer  with 
different  numbers  of  processors  if  it  were  possible  to  duplicate  the  component  representing 
the  node. 
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Overall,  however,  PPioto  was  successful  at  helping  us  identify  the  communication 
bottlenecks,  and  in  analyzing  performance  with  different  numbers  of  processors.  It  would 
also  have  saved  time  in  the  in^lementadtxi  phase,  since  it  would  have  been  easier  to  try  the 
different  communication  strategies,  which  were  actually  implemented,  in  Pl^to,  and  dien  to 
in:q)lement  the  best  of  them. 

We  had  originally  felt  that  selecting  a  parallel  programming  language  would  be  an 
important  part  of  the  design  process.  With  die  current  state  of  parallel  programming 
languages  this  turned  out  not  to  be  the  case.  In  general,  die  choice  oi  actual  in^lementation 
language  would  be  limited  to  one  of  a  few  ctxiventional  languages,  probably  C  or 
FORTRAN  with  parallel  programming  extoisions  (such  as  Linda),  or  possibly  Ada.  The 
languages  designed  for  parallel  programming,  such  as  (Crystal  or  Id,  were  not  available  at  all 
<m  actual  parallel  machines,  or  only  (xi  a  limited  number.  While  due  to  logistics,  this  was  not 
a  oxicem  ftn*  this  project,  in  the  future,  the  chcace  of  in^lementation  language  may  well  be 
an  important  design  decision.  However,  there  are  no  tools  currendy  that  support  making  this 
choice. 

We  had  (»iginally  identified  a  number  of  other  tools  as  supporting  design  including, 
among  others,  POKER,  CODE,  and  Faust  Upon  closer  investigation,  these  tools  appear  not 
to  support  the  design  phase  so  much  as  the  implementation  phase.  They  are  ctmcemed 
mainly  with  making  it  easier  to  compose  parallel  programs.  Both  CODE  and  POKER 
require  drat  some  amount  of  real  code  exist  in  order  to  perform  as  intended.  It  is  possible 
that  these  tools  could  be  used  for  some  aspects  of  design,  such  as  identifying  shared 
variables  in  CODE,  but  code  would  have  to  exist  for  the  rest  of  the  program. 

One  last  point  should  be  made  when  discussing  the  mols  that  sui^xnt  the  design 
phase.  None  of  the  tools  produce  anything  that  could  be  used  as  die  basis  of  a  design 
review.  One  drawback  of  this  is  diat  while  die  final  design  may  be  captured  as  a  prototype 
or  executable  model,  the  decisions  leading  to  diat  design  are  not  captured  anywhm. 

4,5  Implementation 

During  iixqilementation,  target-^iecific  programs  are  developed,  based  on  the  detailed 
design  as  input  Most  module,  algorithm,  and  data  structure  decisions  should  have  been 
made  in  Detailed  Design,  so  what  remains  to  be  done  in  this  phase  is  the  transformatitMi  of 
these  smictutes  to  a  programming  langu^  appropriate  to  the  target  architecture. 

Depending  on  the  fai^ties  of  that  programming  language,  this  transformation  may  involve 
mcne  or  less  effort,  and  considerations  such  as  portability,  scalability,  etc.,  may  be  more  or 
less  supported. 

Most  of  the  tools  surveyed  fall  into  this  category.  Unfortunately,  a  lot  of  them  are  also 
specific  to  a  particular  language  and  target  machine.  It  is  much  easier  to  design  and  build  a 
tool  or  suite  of  tools  with  a  particular  langua^  aixi  target  machine  in  mind.  However,  die 
intent  of  the  PEEP  is  to  provide  a  platform  where  a  programmer  can  experiment  with 
multiple  languages  and  multiple  target  architectures,  so  we  concentrated  on  tools  tiiat 
purported  to  support  multiple  languages  or  to  be  architecture  independent  Debuggers  are 
alnx)st  always  very  target  dependent  so  no  debuggers  were  evaluated. 
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The  tools  used  for  in^lementadon  are  CODE  and  CSTools,  although  with  limited 
success  in  the  case  of  CODE.  Odier  tools  discussed,  but  which  were  not  used  for  the 
experiments  are  Crystal,  PAWS,  PCN  and  POKER. 

CODE  (verskm  1.2)  supports  three  languages,  Ada,  C  and  FORTRAN,  and  a  number 
of  target  machines.  We  tried  to  inclement  the  first  e)q)etiment  in  C  with  CODE.  This 
should  have  been  a  good  match  b^use  CODE  supports  a  graphical  dataflow 
representation,  and  the  quicksort  specificatitm  was  in  Id,  a  dataflow  language.  One  could 
even  imagine  being  able  to  automate  the  tianslatim  between  Id  and  CODE  to  some  extent 
Unfortunately  we  ran  into  a  few  problems  with  die  CODE  in^lemoitatioa.  First,  die 
support  in  CODE  for  replicating  code  sequences  did  not  work,  so  we  could  not  implement 
the  quicksort  as  specified.  We  continued  experinx^nting  w'  jh  CODE,  but  using  the 
exanqiles  that  were  provided  as  tutorials  instead  <  or  own  sample  programs.  In  this  way, 
we  were  able  to  generate  actual  code  that  could  have  been  cotxqiiled  on  a  parallel  machine.  In 
order  to  produce  code  for  a  particular  machine  using  CODE,  a  TOAD  (Translators  of  a 
Declaration)  for  that  machiiw  is  needed.  We  wne  not  able  to  actually  con^e  and  run  any 
of  the  generated  programs  because  the  set  of  TOADs  provided  with  (^DE  did  not  include 
any  of  the  paralld  machines  to  which  we  had  access  (iPSC/2,  warp  sysmlic  array,  CM-2, 
Multimax  or  Alliant).  We  also  tried  generating  an  Ada  version  of  the  tutorial.  The  code 
produced  looked  correct,  but  relied  on  rendezvous  very  heavily  which  could  result  in  poor 
performance,  depending  on  the  particular  Adz.  compiler  used.  On  the  other  hand,  this  was 
the  most  portable  approach  and  should  work  with  any  Ada  compiler.  On  die  positive  side, 
we  were  able  to  get  support  from  CODE’S  developers,  and  in  fact  these  problems  should  be 
fixed  in  CX)DE  2.0  which  is  about  to  be  released.  CODE  2.0  will  support  the  replication  of 
nodes  and  subgnqihs,  and  has  improved  its  Ada  generatitn  c^abilities.  It  should  also  be 
noted  that  die  type  of  mutual  exclusion  being  implemented  wiA  Ada  rendezvous  could  be 
enormously  sinqilified  if  inqilemented  using  the  Protected  Type  introduced  in  Ada  9X. 
Obviously,  diis  will  not  be  available  in  die  noct  version  of  CODE,  but  is  likely  in  future 
versiems  of  CODE  after  Ada  9X  is  approved. 

We  also  tried  to  use  POKER  as  a  coding  tool.  POKER  supports  simulation  of  a 
variety  of  non-shared  memory  MIMD  machines.  The  user  enters  die  parameters  for  the 
machine  indicating  die  number  of  processors  and  the  connections  between  them.  We  did 
not  actually  use  POKER  because  it  couki  not  be  instaUed  on  the  SPARCStation.  Based  on 
reviewing  the  documentation  POKER  appears  to  have  some  capabilities  similar  to  CODE. 
Like  CODE,  it  can  produce  output  that  can  be  compiled  on  a  lii^ted  number  of  target 
machines.  It  also  supports  a  tracing  facility  for  debugging,  and  a  profiler  for  simulating 
execution  times.  In  terms  of  languages,  POKER  only  supports  it’s  own  dialect  of  C,  and  a 
simple  sequential  language  called  XX. 

PCN  was  received  too  late  in  the  project  to  attempt  to  implement  any  of  the 
experiments.  However,  it  was  instaUed  successfuUy.  It  includes  a  number  of  interesting 
features  that  can  be  described.  PCN  is  a  language  and  a  set  of  tools.  One  interesting  aspect 
of  PCN  is  diat  it  {vovides  interfaces  to  FORTRAN  and  C  that  aUow  existing  code  to  be 
reused.  It  also  provides  a  debugger  (PDB),  an  execution  profiler  (Gauge),  and  a  trace 
analyzer  (Upshot).  From  reviewing  the  documents  it  appears  that  PCN  would  support  initial 
debugging  and  analysis  of  the  program  on  the  SPARCStation  prior  to  nwving  the  program 


to  a  parallel  target,  or  a  network  ai  workstations.  Of  particular  interest  is  the  fact  that  PCN 
supports  a  program  running  on  a  heterogeneous  netwoik  of  machines. 

Crystal  was  also  investigated  as  an  in^lementation  language.  Our  e?q>erience  with  this 
language  indicates  that  it  has  a  very  steep  leanung  curve.  Since  only  a  SPARCStation  based 
interpreter  is  available  (a  compiler  targeted  to  the  iPSC/2  is  being  devel<:^)ed),  we 
concentrated  oa  other  tools  and  languages. 

While  PAWS  is  not  a  tool  that  produces  an  executable  that  will  run  on  a  particular 
parallel  target,  it  does  provide  the  ability  to  compare  how  the  same  code  would  run  on  three 
different  machines  representing  three  very  different  classes  of  architectures:  the  CM-2,  the 
Multimax  and  the  iPSC/2.  We  did  try  to  run  an  Ada  implementation  of  the  second 
experiment  through  PAWS  to  see  how  it  would  run  on  the  different  architectures.  Given  the 
nature  of  that  experiment,  we  would  have  expected  the  program  to  map  tixxe  naturally  to  the 
Multimax  than  to  the  Connection  Machine.  Unfortunately,  PAWS  was  not  able  to  process 
all  the  Ada  constructs  used  in  tiie  program,  so  we  were  not  able  tt>  get  coaq)lete  results.  In 
general,  it  seemed  as  though  the  ability  to  compare  different  architectures  would  be  more 
useful  earlier  in  the  tfevelc^ment  process,  such  as  during  design.  One  place  where  PAWS 
would  be  useful  is  in  determining  an  appropriate  parallel  target  for  an  existing  sequential 
program  diat  is  about  to  be  parallelized. 

Faust  was  another  development  environment  for  parallel  programs  like  CODE  and 
POKER.  It  supported  C  and  FORTRAN,  and  a  number  of  parallel  targets.  Faust  initially 
appeared  very  promising  as  it  included  an  integrated  editor,  a  parallelizing  compiler  and  a 
program  monitoring  tool  for  performance  analysis.  If  it  had  one  drawback,  it  was  that  while 
it  had  advanced  support  for  vectcniring  optimization,  it  had  little  capability  to  deal  with 
programs  that  are  parallel  at  a  higher  level  In  the  end  no  experimentation  was  done  with 
Faust  because  the  project  had  been  dropped,  and  the  unsuppcnted  system  that  we  received 
was  incomplete,  and  in^sable  to  ruiL 

Hnally,  the  most  successful  of  our  coding  tools  was  CSTools  from  Meiko.  It  should 
be  noted,  iWever,  that  this  is  a  commercial  tool  and  most  of  the  preceding  ones  are  not;  it 
would  not  be  fair  to  expect  the  University  supplied  tools  to  meet  Ae  same  standards  of 
quality  as  a  commercial  product  CSTools  suppmis  a  number  ct  languages,  of  which  we 
had  C  and  FORTRAN.  It  is  limited  to  a  single  target  the  Meiko  transputer  board.  We  were 
able  to  inclement  versions  of  all  tiiree  cf  the  experiments  using  CSTools.  We  did  run  into  a 
few  minor  problems.  The  Quicksott  algorithm  is  normally  expressed  recursively,  and  the 
first  implementation  followed  this  design.  As  a  resull  die  transputer  processors 
implementing  die  final  sequential  quicksort  overflowed  their  stacks  n^en  the  recursion 
levels  got  too  deep.  This  was  not  handled  very  well  by  the  tools,  and  it  took  some  time  to 
determine  die  actual  cause  of  the  failure.  In  general  the  error  repotting  on  the  transputers 
could  have  been  more  helpful 

We  were  also  able  to  run  a  C  version  of  the  quicksort  on  an  iPSC/2  at  Cornell.  This 
was  based  on  the  code  develt^ied  using  CSTools  for  the  transputer.  To  retarget  the  program 
to  die  iPSC/2  the  calls  to  transputer  specific  library  routines  n^ed  to  be  replaced  with 
iPSC/2  qiecific  library  routines.  Fortunately,  the  C  libraries  for  the  two  targets  are  similar. 
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It  was  also  the  case  that  most  of  the  coding  effort  went  into  developing  the  first 
version  of  each  experiment  Subsequent  versions  of  the  program  were  developed  in  much 
less  time.  Usually,  variations  on  the  original  program  involved  singly  making  a  few 
changes,  and  then  rebuilding  and  running  the  program  to  measure  the  new  results. 

4.6  Testing  and  Int^ration 

There  were  no  tools  that  supported  testing.  Testing  for  all  three  experiments  consisted 
of  hand-writing  tests  and  simply  running  them.  Also,  the  nature  of  the  experiments  did  not 
allow  any  real  evaluation  of  tte  tools'  integration  capabilities.  In  general  we  would  assume 
drat  the  engineer  would  want  a  tool  that  made  it  easy  to  bring  together  independently 
developed  and  tested  pieces  of  a  system.  Each  of  the  experiments  was  designed  to  be  stand- 
al(»ie  so  there  was  no  integration  woik  to  be  done  after  it  was  tested. 

No  tools  for  performance  analysis  were  received  in  time  to  use  them  for  analytdng  the 
experiments.  However,  there  are  two  tools  in  this  area  that  are  still  worth  looking  at:  PCN 
and  IPS-2.  Of  these,  PCN  is  probably  the  most  promising  as  it  does  run  cm  the  Sun4 
wcxkstation.  It  includes  both  an  execution  profiler,  and  a  trace  analyzer.  IPS-2  was  not 
available  for  the  SPARCStation,  but  it  would  ap^cax  to  be  a  worthwhile  addition  to  the 
system.  The  key  feature  of  this  performance  measurement  tool  is  the  ability  to  display 
measurements  at  different  levels  of  abstraction.  If  running  a  program  on  a  large  number  of 
parallel  processors,  the  user  could  easily  be  overwhelmed  with  the  amount  of  performance 
data  if  there  isn’t  some  way  to  filter  it  if  it  were  ever  ported  from  the  VAX  to  a  Sun  host  it 
would  be  worth  investigating  on  the  PEEP. 

4.7  Supporting  Technologies 

This  secticxi  introduces  a  number  of  supporting  technologies  which  are  useful  during 
more  that  one  phase  of  the  development  process.  Some  of  the  tools  mentioned  above 
support  some  aspect  of  these  technologies.  We  also  mention  a  few  additional  tools  here  that 
support  a  ^)ecific  technology  without  being  specific  to  cme  phase  of  the  development 
process. 

4.7.1  Prototyping 

Prototyping  can  be  used  through  out  the  development  process.  During  the 
requirements  phase  a  prototype  user  interface  could  be  mocked  up  to  see  if  it  meets 
requirements;  during  design,  prototypes  help  the  designer  make  trade-offs  between  difierent 
approaches.  In  particular,  I^’roto  lets  the  systems  engineer  experiment  with  different 
process  decomposition  and  data  partitioning  strategies  to  see  whether  the  requirements  can 
be  met,  and  if  a  particular  strategy  is  better  than  the  others.  In  some  ways,  the  Id 
specifications  of  the  problems  could  also  be  looked  at  as  prototypes  of  the  actual  program. 

4.72  Modeling,  Analysis  and  Simulation 

Most  of  the  tools  support  simulation  is  some  form.  PProto  simulates  running  the 
program  as  specified.  The  Id  interpreter  simulates  rtinning  the  program  on  an  idealized 
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dataflow  machine.  POKER  supports  interfaces  with  a  number  of  machine  emulators.  The 
most  in^Kxtant  aspect  of  being  able  to  simulate  a  program  is  the  ability  to  make 
compaiiscxis.  A  particular  simulation  may  not  be  able  to  predict  the  pofcrmance  of  the 
program  (m  a  ^)eciflc  machine,  but  it  should  be  able  m  predict  the  relative  performance  of 
two  simulations  run  using  the  same  simulator.  It  is  also  important  to  have  a  simulator  (or 
simulators)  that  simulate  all  die  aspects  of  parallel  programs.  For  example,  the  simulation 
c^abilities  of  Id  alone  are  not  useful  for  predicting  communication  costs  on  a  message 
passing  architecture.  It  is  useful  to  have  both  types  of  simulators,  that  is,  both  actual 
performance  predictors  like  PAWS,  and  relative  performance  predictors  like  PProto. 

4.7 J  Visualization 

CSN_Illusttate  supports  a  limited  aspect  of  Hardware  and  Software  Integration  well, 
and  shows  how  simple  tools  can  improve  development  in  this  area.  (^N.Dlustrate  simply 
shows  the  layout  of  the  transputer  network,  and  the  mapping  of  processes  to  nodes.  This  is 
useful  for  determining  the  topology  of  the  network,  and  ^termining  if  a  particulaiiy  long 
communication  path  exists  b^een  two  nodes  that  communicate  extensively.  Ihtermetrics 
implemented  CSN.Illustrate  because  the  report  generated  by  the  CSTools  program  builder 
contains  so  much  infonnatitMi  it  was  difficult  to  analyze.  CSN.Illustrate  presents  the  san^ 
information  graphically.  However,  this  is  also  d^  drawback  of  (ZSN_Illustrate,  in  that  it  is 
dependent  on  the  CSTools  report 

TANGO  is  a  tool  that  supports  algotidun  animation.  For  this  project  it  was  ported  to 
the  transputer.  TANGO  had  bMn  used  to  show  grs^hically  how  a  particular  algorithm,  like 
a  binary  search,  worked.  Since  all  TANGO  does  is  display  events  as  it  receives  notification 
finom  the  application,  it  appeared  that  it  would  work  for  pt^el  algorithms  too. 
Unfortunately,  the  nature  of  the  program  tends  to  sequentialize  the  algorithm  when  it  is 
animated.  For  exanqrle,  two  nodes  are  performing  a  process  which  in  the  animation  is 
represented  as  moving  a  square  to  the  right  In  the  d^lay  first  one  square,  then  the  other 
will  move,  rather  than  moving  both  at  once.  As  a  result  TANGO  is  not  very  useful  for 
visualizing  parallel  algorithms. 

4.7.4  Configuration  Managtmont 

One  aspect  of  software  engineering  that  is  missing  fiom  all  of  the  tools  is 
configuration  management  In  some  cases  this  is  not  a  problem  because  configuration 
management  can  be  provided  outside  the  tool.  ¥or  example,  a  C  source  program  fcx*  the 
transputer  board  can  be  kept  under  RCS  or  SCCS  (UNIX  revision  control  systems).  Some 
tools  would  be  much  more  hdpfiil  if  there  was  an  integrated  configuration  management 
systeia  In  particular,  it  would  be  useful  to  be  able  to  track  variations  of  a  program,  and  to  be 
able  to  correlate  a  specific  change  to  performance  improvements  in  a  sper^c  area.  It  is  also 
possible  that  while  recording  die  textual  changes  between  two  files,  an  external 
configuration  management  system  would  miss  a  higher  level  change,  such  as  recording  what 
standard  transformation  caused  the  textual  changes. 
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4.8  Discussion 

No  scrftware  methodology  can  be  evaluated  without  usable  tools  to  make  experiments 
in  semi-automatic  processing  possible.  The  quality  of  tools  might  to  a  first  approximation 
be  expected  to  depmd  on  the  amount  of  research  effort  expended.  The  highest  usability  was 
found  in  parallel  languages,  which  are  traditionally  a  prolific  research  area.  Use  of  multiple 
languages  is  likely  to  pi^uce  meaningful  results  by  testing  experimental  hypotheses  that 
programming  an  algmithm  in  the  right  language  or  class  of  languages  would  enable  the 
algorithm  to  meet  some  requirements  or  system  objectives  more  effectively  than  others.  In 
practice  this  has  not  been  a  factor.  Goiuinely  new  parallel  programming  languages  are 
being  devel(^)ed,  but  are  only  available  as  interpreters  or  on  one  particular  qiecialized 
hardware.  A  particular  langui^e  like  Id  or  Cry^  can  be  used  to  prototype  the  problem,  but 
in  (»der  to  ptcxhice  a  system  that  will  run  (xi  a  useful  target  hardware,  a  dialect  of  (xie  of  the 
standard  high  level  languages  (Ada,  C,  FORTRAN)  is  most  likely  to  be  used. 

Another  active  research  area  that  has  similar  promise  is  compiler  technology  adapted 
to  allow  die  detection  and  use  of  intrinsic  paralleli^  in  sequential  languages.  The  kind  of 
experimental  hypothesis  to  be  tested  on  these  kinds  of  tools  would  be  that  some  class  of 
target  environments  could  be  successfully  and  effectively  utilized  by  programs  in  some  set 
of  source  languages  compiled  with  each  tool  Unfortunately,  no  such  environments  have 
been  successfully  potted  to  the  PEEP  so  far. 

CASE  tools  and  performance  analysis  tools  are  slightly  less  active  research  areas. 
Some  tools  have  been  found  partially  usefiil,  but  incomplete  due  to  less  research  activity. 
CODE  from  the  University  of  Texas  at  Austin  supports  a  few  "ordinary  hi{^  level" 
languages  (Ada,  C,  FORTRAN)  and  compiles  to  several  architectures  via  a  q)ecial  translator 
(called  a  Translator  Of  A  Declaration,  or  TOAD)  for  each  language/target  pair.  It  provides 
an  important  capability,  target  architecture  indq>endence,  that  is  essential  for  any  parallel 
softwsue  develcpment  capability.  CODE  also  supports  a  well  engineered  grafdiicjd  user 
interface  that  allows  the  sequential  segments  of  a  problem  to  be  stated  in  a  way  independent 
of  communicaticm  and  synchronization  mechaitisms.  It  thus  supports  parallel  programming 
with  MIMD  flexibility  and  SIMD  simplicity.  CODE  has  a  number  of  limitations, 
particularly  in  die  ability  to  contrcd  synchronization  code  effectively,  but  diere  is  a  new 
version  soon  to  be  available  that  promises  to  improve  this  situation.  It  is  an  example  of  a 
promising  tool  with  enou^  limitations  cunendy  to  be  usable  only  for  demtmstration 
problems. 

4.8.1  Tool  Evaluation  Summaries 
CODE 

Only  partial  results  could  be  obtained  with  CODE.  Small  programs  similar  to  the 
supplied  CODE  demos  could  generate  complete  programs,  but  none  were  executed  due  te 
not  having  the  right  target  hardware.  Sort  and  search  code  could  be  produced  (but  without 
automatic  replication,  due  to  a  bug  already  mentioned),  but  again  could  not  be  executed.  The 
new  version  of  CODE,  CODE  2.0,  should  fix  most  of  the  problems.  Another  favorable 
aspect  of  CODE  was  that  we  were  able  to  get  some  support  from  the  developers,  although 
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they  were  not  able  to  send  us  any  additional  TOADs.  We  did  not  evaluate  ROPE,  the 
reusable  component  library  associated  with  CODE.  If  it  works  as  documented  it  does 
appear  to  be  a  useful  facility  for  finding  reusable  components  by  searching  for  keywords  in 
the  description  of  the  component 

eSNJUustnUe 

CSN.niustrate  is  a  very  sii]:q)le  tool  for  graphically  displaying  the  actual  configuration 
of  die  Transputer  fw  a  particular  application.  It  is  dependent  on  the  format  of  die  Meiko 
tools  configuration  report  and  had  to  be  modified  when  the  report  changed  in  a  new  release. 

CSTools 

CSTools  are  a  commercial  parallel  programming  toolset  for  the  transputer.  It  includes 
a  compiler,  program  builder,  and  debugger.  These  tools  were  quite  effective  for  developing 
programs  for  the  transputer.  Other  commercial  environments,  such  as  EXPRESS,  were  not 
available  for  conqiarison. 

Id 

Id  capabilities  inpoved  over  the  course  cd  our  evaluation.  Initially,  it  was  difficult  to 
interpret  the  parallelism  graphs  because  separate  activities  (such  as  insertion  and  lookup  in 
searching)  could  only  be  distinguished  by  reprogramming.  The  addition  of  a  capability  for 
iissociating  coIchs  with  Id  functions  in  the  first  Monsoon  interpreter  release  in  mid-1992 
solved  this  dilemma.  Now  the  interpretation  of  sources  of  parallel  activity  in  parallelism 
profiles  is  done  easily. 

An  issue  with  the  use  of  Id  is  its  user  interface,  which  can  be  using  the  shell,  but  for 
debugging  has  to  be  under  EMACS  using  Lisp  commands.  Both  debugging  and 
visualization  were  a  problem  because  any  execution  failures  produced  a  Lisp  interrupt  For 
anyone  familiar  with  functional  languages  who  has  time  to  leam  a  few  Lisp  library  routines 
and  is  willing  to  read  interpreter  code,  tius  is  not  too  much  of  a  problem.  The  Id 
development  team  at  MIT  was  quite  suipxti\te  in  helping  us  install  the  system  and  get  over 
initial  learning  hurdles.  The  documentation  that  accoirpaiues  the  language  also  improved 
with  the  new  release. 

*Lisp 

Thinking  Machines  has  a  *Lisp  simulator  fieely  available.  We  had  h<pd  that  the 
simulator  could  be  used  to  develop  a  *Lisp  program  on  the  SPARCStation  that  would  then 
run  with  little  or  no  modification  on  a  G>nnection  Machine.  CXir  experience  with  the 
simulator  was  that  along  with  being  undocumented,  it  only  implements  a  subset  of  the 
language  and  provides  only  a  subset  of  the  standard  libraries.  As  such  it  is  not  possible  to 
use  it  to  develop  programs  of  a  realistic  size. 
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PProto 

Widt  PProto,  modeling  of  the  image  warping  algorithm  verified  the  communication 
issues  observed  in  the  code-level  Transputer  experiments.  The  graphic  display  of 
communication  paths  showed  the  bottlenecks  diat  led  to  serialization  of  the  graphic 
communicadons  in  the  series  of  experiments.  There  were  some  difficulties  with  modeling 
the  Transputer  architecture  due  to  die  lack  of  enough  database  and  library  facilities  to 
develop  the  model  incrementally.  The  model  had  to  be  read  in  fiom  source  every  time,  and 
diere  was  no  way  to  distinguish  additions  related  to  promtyping  different  variants.  This  is  a 
serious  problem  for  a  tool  that  is  supposed  to  support  rapid  prototyping,  since  it  is  easy  to 
lose  information  and  even  confuse  the  meaning  of  results  widiout  clear  doivation 
tn<»r.hanisfns-  This  is  partially  related  to  a  technique  such  as  Meta-Crystal  for  expressing 
specific  transformations,  but  shows  that  more  general  transformational  derivation 
approaches  are  important  for  really  doing  rapid  prototyping  effectively.  This  problem  is  not 
specific  to  parallelism,  and  projects  supporting  dus  Idixl  of  refinement  such  as  E-L  at 
Htuvard  are  very  inqxxtant  for  parallel  software  development  It  is  possible  that  the 
difficulties  were  related  to  tiie  fact  that  the  tool  was  only  available  for  two  weeks,  and  we 
were  not  familiar  with  its  capabilities. 

Issues  of  human  interface  and  capacity  were  enou^  of  a  problem  tiiat  tiie  utility  of  the 
tool  might  be  compromised  for  mcne  Aan  demonstration  problems.  With  PProto,  the 
slowness  of  rebuilding  and  executing  successive  experiments  was  aggravated  by  poor 
memory  usage,  and  no  way  to  clean  up  between  partial  builds  means  that  the  interpretation 
of  successive  experiments  is  more  dif&ult  Whether  this  would  affect  the  validity  of  results 
might  require  some  e)q>ettise  in  the  structure  of  the  PProto  interpreter,  at  least  in  the  worst 
case  or  using  large  models.  Another  drawback  to  Pl^oto  is  the  fact  that  it  relies  the  ONTOS 
object-oriented  database  system  to  run.  This  does  not  come  with  n’roto  and  is  not  generally 
already  installed  on  a  user’s  system. 

Crystal 

This  language  was  not  fully  evaluated.  We  have  a  working  Crystal  interpreter  on  the 
SPARCStation,  but  the  compiler  that  is  being  developed  was  not  available  during  this 
project 

PAWS 

For  PAWS,  only  the  san^le  programs  supplied  with  the  to(^  successfully  exercised 
the  tool.  However,  tiiis  was  to  be  expected  as  only  the  alpha  version  of  the  tool  was  available 
at  the  time.  The  user  documentation  for  PAWS  could  alw  use  in^novonent  At  least  one 
envirexunent  variable  necessary  to  run  PAWS  is  not  documented.  A  beta  version  of  the  tool 
is  now  available,  and  work  continues  to  add  functionality  and  make  it  more  robust 
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PCN 


The  PCN  sys^n  did  not  become  available  until  the  end  of  the  project,  so  no 
experiments  have  been  tried.  However,  its  documentation  is  quite  ccxnplete  and  indicates  that 
it  is  available  for  a  number  of  platftxms.  It  is  definitely  worth  further  investigation. 

POKER 

POKER  was  not  fully  evaluated  as  it  did  not  run  (» the  Sun4  woikstaticMi.  It  also 
appears  that  no  further  woric  is  being  done  on  it,  so  it  is  not  wrath  further  investigation. 

TANGO 

Attempts  to  use  TANGO  calls  in  Transputer  programs  resulted  in  output  that  was  not 
capable  of  b^g  used  to  display  parallel  activities  on  die  tar^t  hardware.  TANGO  was 
deigned  for  animation  of  sequential  programs,  and  would  need  additional  support  fra 
identifying  parallel  activities  at  the  lowest  level  of  data  collection. 

O^ter  Tools 

Interesting  FC^TRAN  tools,  sudi  as  PAT  and  FORCE,  were  tracked  down  and 
detertniiied  to  be  available,  but  not  pursued  for  installation.  Free  distributions  of  these  are 
available,  and  instructions  fra*  getting  diem  are  induded  widi  the  rcEP. 

A  demonstration  vetsioa  of  BALSA  H,  another  animation  tool,  was  also  tried.  BALSA 
n  is  even  more  restrictive  than  TANGO  in  that  it  only  runs  and  animates  programs  running 
on  the  Macintosh. 

The  experiences  with  Faust  demonstrate  a  problem  of  interest  fra  PEEP  planning 
purposes.  The  problem  is  acquiring  and  installing  tools,  where  die  issues  are  first  getting 
access  and  then  verifying  conpletoiess  of  the  ddivery.  Acquiring  Faust  itself  was  delayed 
for  licensing  reasons  and  was  finally  acquired  almost  two  years  after  the  start  of  the  PEEP 
effort  When  Faust  did  arrive,  there  were  no  installation  instructions  (or  documentation  of 
any  kind)  and  support  only  for  reading  the  tape.  PEEP  installers  tried  to  understand  how  to 
get  it  running  by  experimenting  widi  it  and  looking  at  source  code.  However,  the  source 
code  did  not  seem  to  be  complete  so  the  effort  was  abandoned. 

hiquiries  were  made  rai  availability  of  odier  potentially  inmiesting  tools  at  the 
beginning  of  the  prototype  PEEP  project  These  included  SISAL,  STRAND-88,  DINO, 
RAPIDE  and  Piraeus,  which  are  parallel  research  languages  that  may  be  useful  in  the 
future.  Odier  tools  with  good  concepts  such  as  Hypertasldng,  IPS-2,  TOPSYS  and  Pisces 
are  not  available  on  Suns,  thou^  all  might  be  usefd  if  diey  were  eventually  prated. 

TOPSYS  is  a  parallel  environment  wit’'  special  support  fra  monitoring  that  would  be  likely 
to  be  immediately  useful  if  it  could  if'-v-  "grated  succes^iilly. 

An  inqxxtant  aspect  of  tool  ''.iV.iy  for  tools  having  some  partially  successful  results 
is  the  quality  of  description,  documentation  and  support  Good  support  is  essoitial  to  any 
useful  software  tool  and  includes  many  kinds  of  doctunents:  installation  manuals,  user 
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guides,  tutorials  and  reference  manuals,  as  well  as  consulting  for  problems  encountered  in 
initial  use  and  longer  term  exercise.  This  is  not  a  level  of  support  normally  regarded  as 
feasible  for  university  projects,  but  has  been  attempted  by  a  number  of  the  tool  projects  we 
dealt  widi.  The  good  news  is  that  many  tried  to  include  some  effort  in  this  direction.  The 
bad  news  is  that  the  results  were  mostly  unsatisfactory. 

Many  tools  have  been  discontinued.  This  includes  some  of  the  pioneering  efforts  in 
parallel  environments,  such  as  Prometheus  and  PIE,  which  were  always  thought  to  be 
important  primarily  for  dieir  influence  on  later  work.  Simulators,  for  example  PARET  from 
Bell  Labs,  are  very  important  in  rapid  prototyping  and  studying  relative  costs  of  processing 
and  communication  elements,  but  tend  to  be  used  for  the  projea  that  develt^s  th^  and  not 
to  get  enough  siq>port  to  last  b^ond  that  project  Molecule  was  an  important  project  to  be 
discussed  in  more  detail  in  die  methodology  section.  It  was  a  classic  research  project  in  that, 
it  successfully  demonstrated  a  useful  meth^ology  and  was  compk  ted  with  no  perceived 
need  for  any  continued  existence. 

4.9  Methodology 

If  we  look  at  the  problem  solving  process  for  each  of  saddle  problems,  die  e?q)erience 
reveals  as  much  about  the  effect  of  traditional  strftware  engineering  techniques  as  any 
dteoretical  analysis.  The  in^xxtant  practical  problems  were  proUem  partitioning, 
understanding  variants  of  sanq)le  algorithms,  and  combining  compatible  tools  to  use  die 
capabilities  of  each  in  describing  a  sensible  development  path. 


4S.I  Problem  Partitioning 

Both  die  sorting  and  searching  experiments  had  die  same  problems  meaningfully 
partitioning  the  problem.  For  sorting,  one  issue  was  how  to  relate  processors  to  parts  of  the 
algorithm,  as  solved  by  die  binary  tree  arrangement  oi  the  processors.  For  both  prc^lems, 
the  question  of  how  to  partition  the  ininit  effectively  was  solved  only  after  some  initial  work 
was  done  to  understand  the  communication  costs.  This  is  typically  a  goal  ol  simulatim 
studies.  Even  had  a  simulator  been  available  the  modeling  could  e^y  have  takoi  as  much 
time  as  the  onnmunications  coding;  still  once  code  was  available,  additional  eiqierimoits 
could  be  performed  more  quickly  by  local  modifications  of  the  existing  programs.  It  is 
worth  noting  that  the  second  implementations  succeeded  quickly  pardy  due  to  this  kind  of 
reusabiliQr.  The  mediodology  this  reuse  seons  to  relate  best  to  is  that  CODE,  involving  a 
database  of  sequential  program  segments  that  can  be  connected  under  programmer  control 
by  data  flow  requirements  and  synchronization  constraints. 


4.92  Combining  Id^  Crystal  and  CODE 

All  three  problems  had  some  of  the  same  methodological  difficulties.  While  useful  in 
verifying  ideas  about  intrinsic  parallelism,  the  Id  programs  provided  no  insight  into  the  "real 
problems"  of  organizing  the  coti^tation  to  control  communication  and  synchronization 
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costs.  This  is  where  a  comlnnatioa  of  Id,  Crystal  and  CODE  will  be  die  most  help.  Once  an 
algorithm  is  demonstrated  in  Id,  it  can  be  transliterated  to  Crystal  and  decomposed  using 
rfnmflin  mailings  into  "agent"  subpioblems.  This  mapped  program  can  then  be  input  to 
CODE  to  separate  sequential  activities  fixim  synchronization  mediods.  Then  TOAD’s  will 
translate  the  code  to  the  appropriate  target  architecture. 

Actually  generating  the  code  for  any  of  the  applications  by  diis  process  is  still  mily 
possible  in  principle.  While  the  Id  programs  can  usually  be  deo^strated  to  work 
essentially  as  expected,  the  process  of  functionally  compiling  them  by  Meta-Crystal  like 
transformations  is  gmng  to  have  to  be  done  by  hr^  for  the  foreseeable  future.  Meta- 
Crystal,  still  under  develr^rment,  is  not  available  for  even  initial  evaluatkxi,  and  caiuxk 
dierefote  be  expected  to  be  of  production  quali^,  as  the  experience  with  odier  tools 
denxmstrates.  Transliterating  Ae  iterative  code  to  concrete  languages  such  as  Ada,  C  and 
FORTRAN  is  not  eq)ecially  difBcult  Finally,  CODE  itself  is  still  in  a  process  of  evolution 
aivj  may  be  approaching  die  level  of  code  development  capabilities  necessary  for  continued 
use.  CODE  2.0  will  be  used  for  project  work  in  higher  level  classes  at  the  University  of 
Texas  diis  fall,  and  should  be  available  shortly  afterwards.  Also,  TOAD’s  are  not 
cornpletely  trivial  levels  of  translation,  but  can  generally  be  developed  by  graduate  level 
people  in  a  few  mondis  for  a  particular  target  and  run-time  system. 

This  approach  is  basically  the  same  as  the  Mdecule  paradigm.  Molecule  was  a 
language  project  ^^riiere  the  key  feature  of  the  language  was  die  ability  ti>  specify  the 
program  at  different  levels  of  abstraction.  In  the  highest  level  of  abstraction,  the  program  in 
expressed  in  taiget-independoit  dataflow  semantics.  In  the  next  level  of  abstraction  the 
program  is  expressed  in  terms  of  a  set  of  prunitive  parallel  operations  that  are  appropriate 
for  a  particular  type  of  architecture.  Finally,  the  program  is  translated  to  a  specific  source 
language  for  wltich  there  is  a  compiler  on  the  target  hardware,  such  as  C  for  die  iPSC/2. 


4.9  J  Surveyed  Projects 

The  original  expectations  about  these  projects  were  actually  quite  limited.  Typically 
university  researchers  are  very  happy  about  le^ts  that  demonsoate  the  validity  of  an 
approach  or  the  utility  of  a  tool  concept  and  organizatimi  and  do  not  regard  it  as  within  their 
province  to  worry  abwt  the  viabiliQr  particular  implementations.  A  large  number  of 

projects  included  in  the  original  survey  of  available  tools  and  techniques  were  exactly  of  this 
nature. 

A  typical  one  was  the  Mr^ecule  project,  which  demonstrated  the  feasibility  of 
beginning  with  dataflow  level  notations  using  fairiy  well  understood  transformational 
and  con^nlation  techniques  to  produce  executable  code.  This  was  the  paradigm  that  seemed 
most  like  existing  software  tool  and  support  methodology  in  the  sequential  world.  What  a 
casual  reading  of  the  literature  of  work  on  traditiotud  languages  and  develr^rment 
environnoents  suggested  was  that  the  Molecule  paradigm  was  a  practical  possibility,  using 
mostly  available  software  technology.  Evolutionary  language  arid  compiler  approaches  that 
supported  aspects  of  this  rxx)del  were  the  FORTRAN  PICSES  and  FORCE  methodologies. 
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Other  examples  were  the  evolutionary  variants  of  traditional  sequoitial  languages  and 
run-time  support  techniques  firom  Ada  and  Concurrent  Pascal  (and  various  Concurrent  C’s) 
to  FORTRAN  90  and  Schedule.  From  the  environment  world,  R^^  and  Faust  initially 
seemed  to  demonstrate  that  enou^  parallelism  could  be  detected  and  incrementally 
improved  uptxi  by  comlnnatitMis  of  programmer  supplied,  static,  and  dynamic  analysis  to 
make  the  gradual  improvement  of  essentially  sequential  techniques  a  practical  alternative. 
Again,  it  turns  out  R*^  and  Faust  have  been  |»imaiily  prototyping  techniques  and  the 
engineering  of  real  development  environments  on  t^  basis  has  been  essentially 
discmtinued. 
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5  Conclusions 


In  this  section,  we  summarize  our  findings  and  conclusions.  These  are  organized  as 
PEEP  requirements,  reconomendations  for  the  evolution  of  the  PEEP,  and  a  vision  of  a  next 
generatitMi  PEEP  system. 

5.1  Requirements  for  the  Software  Engineering  of  Parallel  Systems 

This  section  is  (Organized  into  life  cycle  requirements  and  requirements  on  siq)porting 
technologies,  following  the  structure  of  the  revised  tool^xoblem  solution  matrix  (section  2). 

S.1.1.  Life  Cycle  Reqidremenis 

Our  survey  and  experimentation  have  led  us  to  identify  die  following  requiranents  for 
any  system  supporting  tte  software  oigineering  oi  parallel  systems.  Capabilities  needed  for 
parallel  software  engineering,  based  on  our  Survey  and  Experimentation  include  diese: 

•  architecture-independent  program  notation  for  problem  specification  (e.g.,  using  a 
dataflow  langur^  such  as  Id), 

•  ability  to  interpret  programs  to  determine  processing  and  communication  costs,  as 
well  as  to  verify  correctness  cf  alpxithms  and  adequacy  of  synchronization 
methods  (e.g.,  using  a  simulator  such  as  PProto), 

•  architecture-dependent  evaluation  methods  to  determine  bdiavior  on  different 
architecture  classes;  this  can  inclu^  languages  supptming  various  classes  of 
processing  and  communication  operations,  as  well  as  support  for  synchronization 
at  various  granularities  (e.g.,  using  ipptopriate  stdtware  d^lcpment  techniques  to 
develop  and  execute  various  algorithiais  on  qipiopriaie  hardware,  as  partially 
sipported  by  CODE/ROPE,  or  using  architecture  simulation  as  supported 
PAWS), 

•  ability  to  relate  programs  in  different  notatitms  to  allow  all  kinds  of  architecture  and 
target  machine  dependencies  to  be  tried  for  programs  implementing  the  same 
algorithm  (e.g.,  as  done  by  Meta-Crystal  operating  on  Cfystal,  or  as  done  by 
human  beings  using  nxxe  conventional  ttx^), 

•  detenotination  of  algorithm  speedup  as  a  function  ci  number  of  processors  (e.g., 
doing  evaluation  on  mult^le  targets  directly), 

•  ability  to  evaluate  perfonnance  by  measuring  execution  without  stopping  or 
interfering  (e.g.,  as  atterrpted  wiA  TANGO,  and  as  supported  by  event-based 
debugging  techniques  such  as  MDP  and  Dalek). 


These  capabilities  directly  address  the  various  phases  of  software  development  as 
follows: 

•  Requirements  Analysis  using  iterative  specification  and  simulation  techniques  such 

as  and  rapid  prototyping  for  requirements  verification, 

•  Architectural  Design  using  refinemait  and  transformatitMi  metiiods,  as  well  as 
trxxleling  and  interpretation  techniques  diat  conq>are  multiple  execution  such  as 
PAWS  and  PProto, 
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•  Low-Level  Design  and  Coding  using  multiple  languages  and  capabilities  such  as 
CODE/ROPE;  and  target  suppon  software  for  organizing,  compiling,  building  and 
testing  arehitecture-  and  target-  dependent  code  and  modules, 

•  Visualization  and  Performance  Aiialysis  using  tracing  and  monitoring  techniques 
such  as  tiiose  supported  in  modem  parallel  debuggers,  and  in  PCN’s  monitoring 
ftcilities. 

The  remaining  areas  of  the  Revised  Matrix  are  covered  by  tools  that  are  not  really 
unique  to  parallel  systems  and  hence  were  not  dealt  witii  in  connection  witii  the  PEEP.  A 
few  areas,  such  as  completely  non-interfering  monitoting  and  automated  system  selection 
involving  tradeoffs  of  hardware,  software,  sui^xxt  and  run-time  libraries,  are  problems 
tiiat  are  currently  beyond  die  state  the  art  and  cannot  really  be  dealt  with  under  the  PEEP 
philosophy  of  using  existing  tods  and  capabilities.  As  research  progresses  in  monitcaing 
and  abstract  trodeliiig  techniques  (e.g.,  PAWS-style  modeling  at  multiple  levels  of 
interpretation),  support  for  these  curabilities  should  he  incorpoiated  into  the  PEEP. 
Similarly,  once  mote  languages  are  generally  available  on  pa^el  machines,  language 
selecticai  will  merit  futtiier  automated  support 

For  the  purpose  of  prioritization,  we  analyze  capabilities  into  three  categcaies: 
essential,  recommended,  and  mce-to-have.  An  essemial  capability  must  be  present — one 
can’t  develop  a  program  witfiout  it  Language  translators  (ctniqnlers  and  linkers)  fall  into 
titis  category.  Recommended  capabilities  provide  a  maxked  improvement  to  the  parallel 
software  developmoit  process,  but  are  not  essential  to  development  A  nice-to-have 
capability  may  te  useful  for  a  specific  problem,  but  is  ndtber  essential  nor  generally 
recommended  for  parsdlel  softv^  develqrment 

As  the  experiments  on  the  Transputer  attest  none  of  the  tools  we  investigated  are 
really  essential  The  support  for  parallel  progranaming  provided  by  die 
compilerAinker/debugger  toolset  of  the  Meiko  CSTools,  and  similar  vendor  support 
softWe  for  parallel  conqmters  meets  basic  requirements.  There  are  many  limitations,  such 
as  (mly  a  few  availade  languages,  insufficient  buffering  ctqxdnlities  for  effective 
commuiucations,  or  uneven  stq>port  for  various  synchronization  mechanisms.  However,  the 
lack  of  other  capabilities  can  be  conq)ensated  for  by  mote  programming  effort  Without 
additional  programming  tools,  parallel  programs  will  be  developed  through  trial  and  error  as 
was  dime  in  the  initial  experiments. 


•  Recommendation:  Other  commercial  tool  sets,  such  as  EXPRESS,  should  be 
systematically  evaluated. 

Due  to  budget  considerations,  other  commercial  tool  sets  and  languages  were  not 
evaluated  as  part  of  this  study.  Providing  alternatives  to  CSTools  should  be  a  goal  of  the 
PEEP,  and  any  alternatives  need  to  be  investigated  prior  to  incluskm  on  the  PEEP. 

Thus,  most  parallel  programming  tools  as  investigated  by  the  PEEP  are  rated  as 
recommended,  though  in  varying  degrees.  Parallel  languages  and  con^ilers  capable 
optimizing  parallel  loops  are  recommended  because  they  support  software  quality  factors 
such  as  portability,  reliability  and  maintainability  without  sacrificing  efficiency  (assuming 
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tfiat  ttie  same  language  is  available  on  multiple  targets — not  currently  the  state-of-the-art  In 
the  future,  Ada  9X  may  help  in  this  regard,  by  providing  a  language  standard  that 
incorporates  siqtport  for  concurrency  and  distributioit) 

Run-time  library  ct^nbilities,  such  as  micro-tasking  and  broadcast  facilities,  are  also  at 
this  level  of  ioqtortance.  Once  a  few  langut^es  supporting  ccmcurrency  and  basic 
communication  medtanisms  become  availaUe,  more  languages  become  just  nice  to  have. 
Similarly,  once  an  adequate  development  method  is  siqtported,  more  tools  provide  a  kind  of 
oxivenient  redundancy  diat  does  not  si^iificantly  inqnove  the  power  of  the  platfomt 
Parallel  debugging  and  the  current  level  of  visualization  support  as  provided  by  some 
vendors  ate  tools  of  this  land — any  one  tool  will  suffice.  Almost  all  die  odier  tools 
ctmsidered  for  incluskMi  in  the  PEEP  are  also  of  this  kind.  This  includes  additional 
languages  and  all  die  various  tools  for  modeling  and  identifying  parallelism  that  were  not 
fully  evaluated. 

We  conclude  that  the  capabilities  of  Pl^oto  and  PAWS  are  recommended  since  diey 
are  unique  in  addressing  modeling  issues  related  to  rapid  prototyping  and  architectural 
conqiaristHL  hi  addition,  some  sort  of  high-level  data  flow  ^lecification  language  would  also 
seem  to  be  needed.  Of  the  languages  we  tried.  Id  was  quite  successful  in  diis  regard, 
although  another  dataflow  language  with  similar  support  would  probably  serve  as  well 

Thus,  Id,  CODE,  PAWS  and  PProto  are  recommended  techniques.  A  tool  supporting 
transformation  techniques  similar  to  Meta-Crystal  is  also  recommended,  as  is  sufficient 
monitoring  capability  to  do  basic  performance  evaluation.  More  languages  and  tools  to 
assist  in  such  activities  as  visualiz^on,  design  data  capture  and  transformational 
configuration  control  including  derivatitxi  histories  are  nice~to~have  (thou^  there  are  senne 
researchers  who  would  claim  such  facilities  have  always  been  essential).  Ordinary  nice-to- 
have  tools  are  the  myriad  small  tools  that  can  help  with  documentation,  verification, 
description,  and  measurement  of  software.  One  exan^le  is  the  CSN_Illustrate  tool 
developed  for  use  witii  tiie  Meiko  tools  to  depict  code  usage  on  the  Transputer.  The 
following  table  summarizes  the  conclusicm  for  each  tool: 


Tool/Language 

Conclusion 

Ada9X 

Recommended 

CODE/ROPE 

Recommended 

Crystal 

to  Have 

CSN_Illustrate 

Nice  to  Have 

CSTools 

Essential 

(or  equivalent) 
Faust 

Not  Recommended 

Id 

Recommended 

*Lisp 

Nice  to  Have 

PAWS 

Recommended 

PCN 

Recommended 

POKER 

Not  Recommended 

PProto 

Recommended 

TANGO 

Nice  to  Have 

-40- 


•  Recommendation:  Work  on  PProto  and  PAWS  should  continue,  and  production 
quality  versions  of  these  tools  should  be  integrated  into  the  PEEP 

Hiese  two  tools  fit  well  widi  the  methodology  we  have  proposed,  and  there  are  very 
few  efforts  in  the  research  community  diat  offer  comparable  capabilities.  We  find  these  two 
tools  to  be  (^relatively  high  quality  in  leliaMlity,  usa^ty  and  documentation,  as  (xxnpared 
to  the  spectrum  of  University  t(x>ls  we  evaluat^  They  are  not  robust,  production-quality 
products,  however.  We  believe  that  bring^g  diese  tools  to  a  statt  where  they  can  be  reliably 
used  for  large-scale  programs  widi  a  minimum  of  support  from  the  developers  should  have 
as  much  funding  priority  as  further  research  on  the  underiying  technology. 

•  Recommendation:  Nov  developments  in  CODE,  Id,  Ada  9X  and  PCN  should  be 
followed  for  new  developments  and  integrated  into  the  PEEP  as  necessary. 

The  field  of  parallel  computing  is  expanding  rapidly.  It  will  be  important  to  integral 
new  tools  into  the  PEEP  fcrevaluatkxi  as  diey  are  developed  Of  die  to^  evaluated.  Id  and 
CODE  are  two  projects  where  the  work  continues,  and  is  developing  interesting  technology. 
Of  die  languages  Ada  9X  and  PCN  should  be  followed  With  Ada  9X,  the  Ada  language  is 
improving  its  support  of  shared  memory  multi|XOcessors,  and  adding  capabilities  to  provide 
support  for  districted  or  message  passing  architectures.  PCN  has  s(nne  of  the  mote 
advanced  support  for  program  monitoring,  and  gpod  support  for  integrating  existing  ccxk. 

5JJ  Integration  Requirements 

For  the  prototype,  we  have  limited  the  amount  of  “wrsqiping”  required  to  inctnpcxate 
a  new  tool  into  the  PEEP.  Tool  integration  has  been  aided  by  Ae  open  systems  framework 
provided  by  UNIX’s  process  model  and  X’s  client-server  iixxleL  Cture  versicms  of  the 
PEEP  shoidd  maintain  this  opoi  systems’  strategy,  recognizing  that  the  fiameworks  will 
evolve  beyond  what  is  available  today. 

In  the  area  of  ctxnmon  services,  the  object-tniented  programming  noodel  has  identified 
promising  processing  techniciues  in  such  tradtional  activities  as  databases  and  language 
processing.  Ccxicurrent  devdopmoit  consisting  of  related  processes  cooperating  through 
standardized  interfaces  is  inheirat  to  object  oriented  technCics  as  well  as  to  many  aspects 
(tf  CASE  techwdogy.  Recent  trends  in  scrftware  development  envirraments  toward  what  are 
generally  termed  software  process  models  indicate  a  possible  direction  fnr  future  PEEP 
growth.  Taken  together,  all  dtese  kinds  of  research  can  be  thought  of  as  suggesting  ways  to 
improve  conqnling  technology,  to  facilitate  common  tool  interaction,  or  to  make  use  of 
object  orient^  process  control  notaticms.  Which  technique  ot  point  of  view  would  be  best 
for  the  PEEP  is  the  l(xig  term  issue;  achieving  effective  tool  interacticm  in  some  way  for  the 
short  term  is  also  a  serious  prd)lem. 

Within  this  evolving  framewc»k,  it  should  be  possible  to  increase  tiie  degree  of  tool 
integration  by  defining  uniform  data  and  event  formats. 


Another  aspect  of  integration  is  Configuration  Management.  This  is  crucial  both  for 
maintaining  product  consistency  within  large  development  effwts,  and  for  maintaining  a 
program  and  its  variants  within  exploratory  programming .  In  using  PProto,  we  (4>seTved 
that  proto^pes  would  be  easier  to  build  if  there  was  a  mature  cut-and-paste  facility.  As  we 
discovered,  ooc  often  wants  to  rqjlicate  a  given  node  or  graph  fragment  to  use  els^here. 
However,  this  is  hampered  by  the  need  to  specify  input  and  output  ports  on  a  node  (at  all 
times).  It  would  be  helpful  if  diete  was  a  way  to  specify  absolute  and  relative  ports  (as  in  the 
cells  of  a  spreadsheet).  Absolute  ports  would  always  refer  to  the  same  place,  even  whoi  a 
node  is  duplicated.  Reladve  ports  could  refer,  for  exatr^le,  to  an  adjacoit  node. 

A  Configuration  Management  system  should  allow  users  to  ctmtrol  multiple  versions 
of  a  numbo:  of  diverse  kinds  of  entities  and  tire  relationships  among  tiiem.  Entities  may  be 
at  any  granularity  witiun  the  system  (as  tire  above  example  suggests). 

Throughout  the  SEPA  program  we  experienced  a  need  for  better  automated 
configuration  management  support  Much  ^  our  wcnk  on  the  sample  problems  was 
ai^noached  making  sli^t  variants  on  an  irtitial  program,  re>building,  re-testing,  and  re¬ 

measuring  p^onnance.  This  cycle  was  repeated  many  times  for  each  problem.  Thae  were 
no  suitable  facilities  for  isolating  the  differences  between  successive  versicms  rtf  the 
programs,  and  conelating  the  changes  with  die  nsodified  test  cases  or  new  performance 
results.  Meaningful  relationships  between  tiie  changes  and  the  pexfotmance  impact  of  fhose 
changes  had  to  be  established  and  tracked  by  hand. 

•  Recommendation:  A  comprehensive  corfigwradon  management  facility  be  added 
to  the  PEEP  and  integrate  with  all  the  tools  that  support  our  recommended 
methodology. 


The  “Artifacts”  system  developed  for  DARPA’s  ProtoTech  program  would  make  an 
excellent  starting  point  This  system  maintains  a  network  oi  fine-grained  arbitrary 
relationships  between  source  programs,  test  cases,  results,  and  any  odier  data  in  die  system, 
and  allows  the  automatic  triggering  of  artntr^  tools  upon  changes  to  any  of  the  Hata 

5,2  SEPA  Long-Term  Plan 

This  section  contains  specific  recommendations  for  PEEP  evolution  and 
recommendations  for  further  work  by  Rome  Labmatory  in  this  area. 

SJ.I  Tool  Enhancements 

To  use  existing  tools,  the  following  imjxovemmts  ate  needed: 

*  The  ability  to  map  Id  programs  semi-automatically  to  decoirqrosed  subproUems. 

A  tool  to  do  this  would  consist  of  lists  of  data  and  control  structure  transformations 
togedier  with  pre  and  post  conditions  needed  to  apply  them.  A  method  of  supplying 
additional  information  about  program  data  and  operations,  as  well  as  intended 
architecture,  would  be  one  of  the  main  interfaces  for  this  tool  The  effea  would  be 


to  allow  the  user  to  program  a  formalism  similar  to  Crystal  to  allow  Meta-Crystal 
transfoimatioas  {^tpropriate  to  the  possible  architecture  to  be  ^lied  The  output 
would  be  a  Crystal-like  program  widi  enou^  architectural  spe^city  to  be 
translated  to  a  concrete  language  such  as  Ada,  C,  *LISP  or  FORTRAN  without  any 
actual  target  dq)endcacies,  but  with  oiough  architectural  dependence  to  be 
cooqnled  withtMt  losing  parallelism. 

•  The  ability  to  keq)  derivation  informatitm  in  ROPE.  Infrxmation  on  the  user- 
supplied  characteristics  and  transformations  used  to  produce  the  concrete  code 
would  be  kept  with  the  program  in  a  ROPE-like  libr^  systeoL 

•  More  flexible  synchronization  conditions  in  CODE  to  allow  better  synchronization 
using  target  compiler  capabilities.  CODE  2.0  is  supposed  to  support  guard 
expressicxis  which  wiU  do  this. 

•  More  TOAD’S  for  mote  targets,  particularly  to  support  C  for  popular  MIMD 
machines.  There  are  supposed  to  be  ways  to  get  these  with  minimal  (two  montii) 
effort  These  could  be  d^eloped  as  needed 

•  The  ability  ci  target  linkers  and  run-time  monitor  facilities  to  support  intelligent 
monitoring  and  tracing.  These  already  appear  to  have  been  prototyped  in  the 
TOPS  YS  project  successfully;  Meiko  and  other  hardware  vendors  appear  to  intend 
to  suppmt  full  monitoring  and  tracing  eventually.  It  is  also  possible  this  could  be 
added  to  PCN. 

Tools  to  supply  more  life-cycle  coverage  would  include: 

•  Mote  languages,  such  as  C*  and  Linda.  Ultimatdy  the  goal  could  be  to  allow  any 
parallel  langua^  to  be  used  on  the  I^EP.  These  could  be  procured  individually  as 
needed 

•  Mote  static  and  dynamic  analysis  ux>ls,  particularly  in  tiie  support  of  Testing  and 
V&V.  These  could  be  taken  finm  existing  DoD  supported  efforts  and  tailored  to 
tile  PEEP,  or  procured  individually  as  needed 

•  Mote  advanced  design  tools  such  as  Meta-Crystal  and  those  supported  by  the 
DARPA  Commcm  Prototyping  Language  effort,  iiKluding  Proteus  in  the  area  of 
languages  and  E-L  in  the  area  of  transfmmatkHial  design  support 

The  issue  of  appropriate  and  useful  documentation  for  user  friendly  access  to  tiie 
results  of  applied  research  projects  needs  to  be  kqit  very  prominent  If  possible,  there 
should  be  fr^w  up  <ni  tiie  Faust  experience  to  determine  if  there  is  any  way  to  avoid  the 
loss  of  the  results  ^  such  an  interesting  project  fri  particular,  the  Sigma  editor  and  the 
IMPACT  mraitoring  design  using  automatic  probe  insertion  would  be  important 
capabilities  in  die  long  term  reEP  if  tiiey  could  be  recovered 

SJJ  PEEP  Evolution 

•  Recommendation:  The  parallel  software  development  methodology  evolved  during 
the  course  <^the  SEP  A  contract  should  be  supported  by  a  well-integrated  set  of 
robust  tools,  and  used  experimentalty  by  organizations  outside  Intermetrics  to 
validate  our  assessment  of  its  value 


To  do  diis,  the  tools  must  first  be  made  more  robust,  and  reliable,  and  should  offer  a 
consistent  user  inter&ce.  They  should  then  be  integrated  into  the  proposed  methodology. 
This  could  be  achieved  through  the  development  of  a  composite  Users  Guide  that  presents 
the  metlMxiology  and  the  tocds  in  an  integrated  manner. 

5  J  J  Sustain  PEEP  Technology  Base 

Recommendation:  A  mechanism  should  be  created  for  capturing  and  preserving 
the  software  that  results  from  University  research  in  parallel  processing.  When  a 
tool  is  procured  under  sponsored  research,  there  should  be  requirements  on  the 
delivery  cf  adequate  installation  instructions  to  get  all  components  running,  and 
on  the  existence  of  am  introductory  user  manual  with  exaunples.  These  systems 
should  be  archived  mth  all  such  materials. 

The  goal  would  be  to  preserve  sufficient  information  to  allow  both  meaningful 
evaluation  and  estimation  die  value  of  continued  support  or  extmsion.  One  clear  and  very 
disappointing  conclusion  of  our  work  is  that  much  promising  University  software  is  being 
lost  Parts  of  the  very  promising  Faust  system  were  not  included  on  die  source  distribution 
tape,  and  could  not  be  located  by  die  leseardiers  who  remain  at  the  university.  The  Molecule 
project  had  ended  and  the  software  was  '‘put  to  rest** 

Most  researchers  see  published  papers  as  die  primary  output  oi  dieir  work,  and  these 
papers  are  propedy  archived  and  readily  retrievaUe.  The  software  diey  develt^,  however,  is 
not  stored  or  t^talogued  in  any  ^tematic  way.  Often,  all  trace  ad  apotentially  valuable 
software  system  is  expunged  from  storage  upcm  graduation  or  transfer  of  die  individual 
researcher. 

Although  litde  of  this  software  is  of  {xoduction  quality,  it  is  ncxietbeless  a  very 
valuable  resource.  It  can  be  used  for  more  detailed  understanding  of  the  ideas,  for 
productizing  by  third  parties,  or  as  a  source  of  raisaUe  components. 

We  believe  that  it  would  be  a  major  benefit  to  the  parallel  processing  cranmunity  if  a 
mechanism  were  established  for  capturing  and  preserving  these  scrftware  assets.  Simply 
indexing  the  software  against  the  puUished  papers  describing  the  research  would  be  a 
sufficient  retrieval  mechaiusm.  What  is  neecM  is  a  centralized  cdlection  activity.  Provisions 
should  also  be  made  in  government  research  grants  to  support  diis  archiving. 

We  do  not  propose  to  change  the  intellectual  property  ri^ts  to  any  such  software. 

The  existing  legal  system  for  determining  rights  and  ownership  is  workaUe.  It  may  require 
negotiation  with  univeraties,  but  this  can  be  done.  However,  widiout  assurance  diat  the 
software  still  exists  in  a  usable  form,  this  negotiation  process  is  not  worthwhile. 

We  believe  that  Rome  Laboratories  could  fill  a  very  valuaUe  role  by  providing  such  a 
central  archiving  activity,  at  least  for  the  products  of  government-funded  Universi^  research 
in  parallel  processing. 


•  Recommendation:  Further  research  should  be  encouraged  in  specific  areas  where 
the  SEP  A  program  has  been  unable  to  find  adequate  existing  tools 

We  believe  that  TOADs  should  be  created  for  more  popular  parallel  architectures 
using  the  CODE  system.  Better  tools  for  the  visual  display  (animation)  of  the  dynamics  of 
parallel  programs  in  execution  should  be  de>«loped.  Tools  for  testing  parallel  programs 
should  be  created  that  will  handle  the  non-deterannacy  inherent  in  these  programs.  Tools 
for  designing  parallel  programs  (before  coding  and  independent  of  prototyping),  and 
analyzing  the  requiranents  (independent  of  dw  design)  are  generally  missing  fnxn  the  field 
and  deserve  more  attention. 

•  Recommendation:  Systematic  independent  evaluation  of  emerging  technology  for 
parallel  software  development  should  be  continued 

Reports  on  the  develc^ment  of  new  tools  and  techniques  are  being  published 
continuously.  We  could  have  easily  spent  all  our  time  surveying  die  literature  and  keeping 
up  with  the  new  developments.  In  terms  of  this  project  it  was  necessary  to  stop  collecting 
data  at  some  point,  and  start  hands-<xi  evaluations.  Based  on  die  hands-on  evaluation  we 
have  found  that  the  published  literature  on  new  parallel  software  tools  and  mediodcdogies  is 
not  a  very  suitable  basis  for  evaluating  dieir  usdfulness.  All  papers  report  success.  The 
usability  of  the  languages,  tools,  or  methodologies  by  diird  parties  cannot  readily  be 
established  by  reading  the  piqiers.  Fuidier  programs  like  SEPA,  in  which  an  independent 
third  party  installs  the  tool  and  uses  it  for  simple  problems,  will  help  to  calibrate  the 
applicability  of  the  new  ideas  that  have  emerged  since  our  surv^. 

•  Recommendation:  For  the  evaluation  of  the  appUccdtUity  of  parallel  software 
development  techniques  to  command  and  control,  a  command  and  control  bench¬ 
mark  should  be  developed. 

Most  benchmark  and  example  programs  used  for  evaluating  parallel  tods  and 
architectures  are  scientific  and  mathematical  applications.  As  such  diey  do  not  provide  a 
good  measure  of  such  tools  and  architectures  for  command  and  control.  A  sample 
command  and  control  problem  would  need  to  include  aspects  common  to  many 
2q)plications  such  as  resd-time  processing,  large  database  processing,  image  processing,  and 
simulation. 

In  order  to  provide  a  realistic  sart^le,  even  a  "small”  application  would  be  quite 
complex.  As  a  result,  a  successful  sanqrle  would  need  to  be  weU  documented  with  both 

requirements  and  design  as  well  as  a  user’s  guide.  A  full  set  of  input  data  would  be  required 
along  with  documented  expected  results.  A  retargetting  manual  would  also  be  needed  for 
rewriting  the  application  in  languages  odier  than  the  initial  implementatioiL 


SJ.4  Next  Generation  PEEP 

Using  die  base  technology  of  a  fully  populated,  highly  functional  PEEP,  it  is  possible 
to  take  die  next  step  in  supporting  hi^  performance  parallel  conqiuting: 

•  Recommendation:  that  the  PEEP  be  extended  to  support  "architectural 
supercompiling" 

The  supetcoo^iler  concept  has  been  developed  over  the  past  decade  and  was  named 
by  Wolfe  in  198Z  It  involves  the  application  uniform  deductive  methods  fimn 
verification  and  inference  research.  Compiling  with  a  supercompiler  is  the  selective 
application  of  transfoimatitms  to  a  program  representation  that  is  successively  refined  until 
it  can  execute  on  the  desired  hardware.  Supeicompiling  provides  a  way  of  looking  at  the 
cooperation  of  limited  transformation  took  on  a  platform  such  as  the  PEEP  as  a  kind  of 
non-deterministic  program  mapping  that  resembles  compiling. 

The  oqieiiments  we  have  conducted  have  led  us  to  believe  that  the  best  long-term 
qipoitunity  for  automated  support  of  parallel  software  development  is  implementation  of 
the  supercoir^iler  notiort  This  follows  firom  our  recommended  mediodology:  starting 
development  with  very  high-level  languages,  such  as  Id,  and  refining  programs  in  stages 
through  mote  ccmctete  languages,  adding  mformation  about  details  of  the  target  architecture 
in  the  later  stages. 

The  efifectiveness  ctf  the  proposed  metiiodology  and  die  selected  tools  would  be 
greatly  oihanced  if  it  were  possible  to  mix  lanf'  /ages  at  any  level  of  this  refinement  process. 
Unfortunately,  most  of  the  tools  that  ate  availabL  at  present  ate  tied  to  single  languages,  or 
small  sets  of  languages. 

One  of  the  noost  useful  concepts  in  the  field  is  tiie  use  of  human-guided,  machine- 
facilitated  transformations  to  reduce  the  higher-level,  architecture-independent 
representations  of  programs  to  concrete  form.  This  is  die  supercompiler  notim. 

At  present,  the  Meta-Crystal  system  being  develqred  at  Yale  comes  closest  to 
providing  the  desired  mechanistiL  However,  Meta-Crystal  is  tighdy  connected  to  the  Crystal 
language,  which  does  not  meet  die  needs  of  most  parallel  software  develqiers.  We  believe 
that  the  idea  of  human-guided  successive  transformations  is  too  ingiortant  to  be  linked  to  a 
single  language.  The  PEEP  should  provide  a  frameworlc  for  an  evolving  library  of 
transformations  that  can  be  applied  to  programs  written  in  any  language. 

The  superconopiler  concqit  for  parallel  applications  requires  a  suite  of  language 
translators  and  odier  tools  that  operate  on  a  common  internal  representation  of  programs. 
Developing  such  a  stdm  of  tools  will  be  a  major  undotaking.  It  will  require  the  cooperation 
of  government,  academia,  and  industry  to  create  a  sufficioidy  broad  set  ci  languages, 
analysis  and  simulation  tools,  and  transformations.  The  first  step  must  be  to  define  a 
comnoon  representation  that  can  be  accepted  by  all  parties. 

Industry  is  not  ready  to  accept  a  universal  low-level  representation  for  pRograms. 
SIMD  vendors  will  not  accept  a  standard  that  is,  or  seems  to  be,  strtxigly  oriented  to  MIMD 
architectures,  and  vice-versa.  MIMD  vendOTS  who  provide  a  shared  rocwory  model  will  not 
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accept  a  standard  that  is  stnxigly  oriented  to  message  passing.  Even  if  a  consensus  on  low- 
level  representations  could  be  forced,  it  might  be  harmful  to  the  further  advancement  of 
parallel  processing  technology  —  whether  MIMD  or  SIMD,  Shared  MemOTy  or  Message 
Pasring,  Vector,  Dataflow,  or  other  architecture  classes  are  best  is  still  controversial.  This 
should  be  resolved  by  q)en  competition  in  die  marketplace. 

However,  we  believe  diat  a  high-level,  architecture-neutral  r^nesentation  could  be 
defined  that  would  be  acceptable  to  all  parties.  In  our  expoiments  under  the  SEPA  program, 
we  have  found  that  a  hig|i-level  dataflow  representation  of  programs  (using  U)  is  readily 
mappable  to  all  classes  of  architectures.  Furthermore,  the  tools  that  would  be  needed  to  map 
programs  represented  at  this  level  to  low-level  representations  for  specific  architectures  are 
the  very  facilities  diat  cort^irise  a  superconqiila'  system. 

DARPA’s  ProtoTech  program  has  developed  technology  that  addresses  this  need — 
die  Kernel  CPL  multi-language  environment  firarneworit.  This  fiamework  will  need 
adaptation  to  meet  the  needs  of  die  full  range  of  parallel  processing  machine  but  its 
effectiveness  as  a  supercon^er  firamework  for  sequential  languages  has  already  been 
demcxistraifid,  particulariy  the  work  of  Yale  University.  Intermetrics,  and  Software 

Options  Inc. 

ProtoTech  has  also  developed  a  fnmal,  mathematically  miented  language  for 
prototyping  parallel  software,  Proteus.  Although  Proteus  is  tied  to  shared  memory 
architectures,  it  merits  fiirdier  investigation  as  a  member  oi  a  family  of  languages  to  be 
supported  by  the  PEEP.  There  are  many  parallel  programming  languages  and  protoQ^ng 
tools,  so  implementing  Proteus  on  the  PEEP  is  not  high  priori^,  but  the  framework 
provided  by  the  Kernel  CPL  techntdogy  is  unique  and  very  in^xxtant 

The  PEEP  should  be  the  framework  for  this  integration.  The  experiment  could  be 
performed  before  adding  the  supercompiler  technology  discussed  above,  or  could  be 
deferred  until  that  technology  is  present  in  the  PEEP. 

The  necessary  “productization”  of  the  required  tools  could  be  done  by  enlisting  the 
cooperation  of  tire  original  developers  or  tiieir  funding  agencies,  by  separately  contracting 
with  those  organizations,  or  by  having  a  PEEP  foUow-on  contractor  work  witii  their  source 
code.  We  expect  that  the  latter  method  would  yield  more  immediate  results,  but  that 
establishing  more  mechanisms  for  helping  university  researchers  to  produce  robust  tools 
would  be  of  more  long-term  value. 
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A.  Experiment  Descriptions 

A.1  Experiment  1:  Exploiting  Inherent  Parallelism 

Parallel  sorting  is  a  classic  problem.  Very  good  sequential  algorithms  exist,  such  as 
radix  exchange,  diat  have  essentially  no  inherent  parallelism.  For  applicatitms  with  very 
large  quantities  ci  data,  or  severe  timing  requirements,  parallel  approaches  are  indicated. 
Hardware  methods  with  die  necessary  stochastic  properties  have  been  extensively  studied, 
and  are  ooc  solution.  In  principle,  diese  could  be  programmed  in  software,  but  probably  not 
efSciently  except  on  dedicated  mesh  architectures. 

Assuming  that  hardware  is  infeasible,  traditional  metiKxls  of  sorting  may  be 

used  witii  sigitificant  parallel  speedup.  This  invtdves  choosing  an  algorithm  with  significant 
intrinsic  parallelism,  or  using  a  divide-and-conquer  approach  to  partitioning  the  data. 
Techniques  that  divide  tiie  input  into  as  many  chunks  as  there  are  processors,  sort  each 
chunk,  and  merge  die  sorted  results  are  good  candidates  for  diemetical  algoridim  speedup, 
but  are  effective  in  practice  only  if  commutucation  costs  can  be  kept  relatively  small.  In  a 
mixed  approach,  individual  processors  can  use  fast  sequential  algorithms  and  less  refined 
commurtication  techruques  diat  are  less  optunal  dian  programmable meshes  or  systolic 
arrays.  This  is  the  area  where  code  replication  techruques  are  useful  on  MIMD 
arcltitectures. 

We  chose  Quicksort  for  our  experiment  Parallel  sotting  has  been  studied  extensively, 
and  other  more  parallel  sorting  algoridinis  are  docunmited  in  the  literature.  However,  our 
intent  in  dus  erqreriment  was  to  try  to  parallelize  a  standard  sequential  algoridim  widi  the 
tools  available.  Certainly,  if  we  wanted  the  best  possiUe  results  in  terms  of  performance  a 
different  algorithm  would  have  been  choseit 

Our  parallel  son  routine  involved  a  sttaightfmward  attend  use  die  intrinsic 
parallelism  of  Qdckson  in  a  dataflow  language.  Its  formulation  in  Id  selects  an  arbitrary 
element  (which  can  be  the  first)  and  divides  the  input  into  two  parts,  consisting  of  elements 
that  are  less  and  greater  dian  dw  selected  one,  and  then  applies  the  same  function  to  the 
parts,  (^ckson’s  basic  formulation  in  Id  is  as  follows: 

(qson  lesseq-a)  -h*  a :  (qson  greater-a) 

where  “++”  is  a  list  concatenation  operation,  is  a  LISP-like  cons  operator,  “a”  is  the 
selected  element  and  “lesseq-a”  and  “greater-a”  are  the  parts  of  the  input  values  that  are 
recursively  sorted  by  Quicks^ 

The  strategies  used  for  sorting  are  described  in  section  4. 

A,2  Cooperative  Concurrency 

Searching  algorithms  are  also  well  studied.  Even  more  than  in  sorting,  fast  searching 
mediods  rely  on  techniques  for  orgaitizing  data  that  are  inherendy  sequential  One 
parallelization  approach  would  be  to  do  a  fast  parallel  sort,  and  do  all  searching  using  a 
sequential  binary  search  algorithm.  This  approach  has  serious  problems  with  the  insertion 
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of  new  oitries,  making  it  inappn^niate  in  real-time  applications.  Using  multi-tasking  to 
synchronize  insertion  and  deletion  by  dividing  input  into  chunks  and  merging  similar  data  is 
as  valid  a  technique  as  for  sorting.  However,  mer^g  is  somewhat  more  con^licated  and 
expensive  when  searching  since  the  point  where  the  insertion  takes  place  has  to  be  protecred 
until  the  merge  is  con^leted.  Merging  insert  (write)  functions  with  lotrinq)  (read)  ftmctions 
introduces  synchronization  issues.  The  intent  of  die  second  experiment  was  to  investigate 
die  synchronization  issues  associated  with  this  type  of  algorithm.  One  sirrqile  way  tt> 
achieve  this  is  to  associate  parallel  processes  with  parts  of  data  or  the  memory  they  access. 
This  is  a  powerful  way  to  use  multiple  processors  in  real-time  searching.  It  can  be 
implemented  like  a  multi-tasking  system  and  adjusted  dynamically  using  tqiprc^itiate 
scheduling,  but  is  also  very  adaptable  to  mid-  and  fine-grain  parallel  architectures. 

A  basic  idea  for  such  a  search  is  to  treat  n  processors  as  searching  through  a  multi¬ 
threaded  list,  where  parts  of  die  list  can  be  global  and  parts  more  local  (shared  by  some  or 
even  non-shared).  Each  segment  die  list  can  be  treated  as  entoed  by  a  hash  function  that 
selects  i^ch  buffer  and  processors  to  use.  For  local  memory,  the  processor  is  determined 
by  selecting  its  address  space.  For  shared  memory,  synchrmiization  needs  to  be  associated 
with  the  insert  actions.  Ptocessos  could  be  difToentiated  on  the  basis  of  whedier  they  have 
access  to  shared  memory,  and  would  have  slightly  different  lookup  and  add  actions.  This 
would  be  an  inherent  difference  between  MIMD  and  more  uniform  SIMD  architectures. 

In  its  smallest  form,  a  SlMD  algorithm  would  probably  be  rou^y  a  recursive  Id 
program  with  two  functions:  one  to  add  a  value  to  the  list  bared  on  some  key  that  is  a 
fiinctitxi  of  its  value,  and  tiw  other  to  lookup  a  value.  There  would  be  some  main  or 
environmental  process  adding  values,  and  erentually  multiple  clients  trying  to  access  the 
values  (in  order  to  verify  and/or  qierate  on  tiiem).  The  Id  functions  we  uied  were: 

type  struct-elem  =  (record 
key  =  int; 
value  =  *0}; 

def  lookup  key  struct 
(case  struct 
nils  nil; 

I  (struct-elemuest)  *  { 

if  (matching-k^  struct-elemJrey  key)  titen 
value  s  struct-eloiLvalue; 
else  if  (possible-key  struct-eletukey  key)  then 
value  »  Gotdnip  rest  key) 
else  value  »  nil; 
in 

value}}; 

def  add  new-elem  struct « 

(case  struct  of 
nil  a  new-elem; 

I  (struct-elem:rest) » { 

if  (possible-key  struct-elenLkey  new-eletiLkey)  then 
(struct-elem :  (add  rest  new-elem)) 
else  (struct-elem :  ( new-elem :  rest )); 
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}); 

Ifere,  “struct”  is  an  Id  I-structure,  built  by  “add”  —  “add”  is  written  as  if  it  returned  a 
structure  when  vriiat  it  really  needs  to  do  is  insert  into  or  append  to  the  global  I-structure 
from  a  ctxitrolling  process.  At  the  simplest  level,  “add”  will  insert  elements  whenever  it 
encounters  a  posidon  that  is  vacant  This  can  be  thou^t  of  as  encountering  an  end  of  tape 
maTfay  and  writing  the  next  element  The  keys  can  be  diought  of  as  relative  addresses  which 
match  when  they  are  equal  and  are  possible  matches  when  they  do  not  exceed  the  last  used 
address. 


A3  Target-Indepradent  SIMD 

The  image  warping  algorithm  uses  map  data  containing  a  height  and  color  for  a 
rectangular  area  and  determines  the  data  used  frv  creating  an  image  viewed  from  a  particular 
angle  that  looks  3-dimeiisional  to  the  viewer.  The  basic  “Floating  Horizon  Algorithm”  is  as 
foUows:  if  we  have  elevation  and  color  data  on,  for  sin^licity,  a  N  by  N  grid,  we  can  use  the 
floating  horizon  algorithm  to  display  that  data  given  a  viewing  angle  of  phi  We  may  think 
of  the  data  as  hd^t  function,  h,  tmqjped  into  3-D  space  by: 

Zaj)=h(X(i).Ya)) 

where  X(i)  =  i,  Y(j)  *  j,  ij  =  0, 1, N-1 

If  the  image  is  to  be  viewed  at  angle  phi  (the  top  view  is  given  by  phis90),  the 
projection  is  given  by: 

(X,Y,Z)  ->  (X,  Y  ♦  sin(phi)  +  Z  *  cos(phi) ) 


Given  a  cross-section  as  shown  in  the  diagram,  the  height  on  the  viewing  plane  of  a 
particular  ptnnt  (Y,  Z)  is  calculated  using  the  equation  above.  If  the  calculated  hdght  is 
greater  thsm  the  current  horizon,  then  select  tiie  point,  and  make  it  tire  new  horizon. 
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Otherwise,  *e  point  is  below  the  horizon  as  viewed  from  angle  phi,  and  it  is  ignored.  This  is 
done  for  all  Y’s  to  produce  one  column  of  the  projected  image.  The  same  calculation  is 
drnie  for  all  columns  (i.e.  X*s) 

The  pseudo-code  for  this  algorithm  may  be  expressed  as: 

procedure  DisplayHeightField  (Z,  C,  N,  phi)  is 

arguments: 

ZQG  indexed  square  array  of  heights 

con  indexed  square  array  of  color 
N  number  of  dements  in  a  row/column  of  Z 
phi  view  angle 

variables: 

ImQQ  indexed  square  array  that  represents  the  rendering 
ijjc  integers 

p,pO  projected  hdghts,  screen  y-coordinates 
horiz  integer 

function  P(x,y)  is 
begin 

return  int[(y*sin(phi))  +  (z*cos(phi))]; 
endP; 

begin 

for  i  in  0..N-1  loop 

— reset  horizon  for  new  column 

horiz  :=  0; 

for  j  in  0..N-1  loop 

— calculate  i^jectitxi  for  point 
pO:=Pa,Z[i][j]); 

p:=p0; 

—  fril  in  image  if  above  horizon 

—  and  fill  in  intermediate  pixds 
while  p  >  horiz  loop 

Im[i][p]  =  C[ilO] 

p  =  p.l 

end  loop; 

—  save  new  horizon,  if  necessary 
if  pO  >  horiz  then 

horiz  :s  pO 
end  if; 
end  loop; 
end  loop; 

end; 


Note  that  we  simplified  the  data  partitioning  by  restricting  the  directicHis  fix>m  which 
the  image  could  be  viewed,  and  displaying  an  orthogonal  view,  rather  than  a  perspective 

view.  These  simplifications  do  not  clumge  the  basic  algorithm,  just  die  amount  of  data 
required  by  each  processor. 
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In  implementing  the  image  warping  algoridun,  we  went  through  four  iterations.  The 
first  implementadtm,  shown  in  the  figure  below,  resulted  in  poor  performance.  As  individual 
pixels  were  sent  fiom  die  tiasnputers  to  the  display,  there  was  a  severe  communications 
botdeneck  betweoi  the  two  machines. 


In  the  nnt  inqilementaticMi,  most  of  the  {plication  remained  the  same,  but  instead  of 
sending  individual  pixels,  it  sends  X  image  structures  to  reduce  the  number  of  calls  die  X- 

server.  This  had  better  perfonnance,  but  there  was  still  contention  for  the  communication 
line  to  the  X-server.  As  a  result  the  transputer  network  would  intermittendy  hang,  and  the 
data  on  the  screen  was  sometimes  corrupted.  Since  not  much  chan^  between  the  two 
inqilementaticxis  it  todc  very  litde  time  to  iixqilement  the  second  version  based  on  the  first 


The  third  implementation,  shown  below,  added  daisy  chaining  between  the  transputer 
nodes  to  eliminate  the  corrupted  data.  This  was  generally  successful,  however,  the  transputer 
network  would  still  hang  on  occasion  as  the  single-thre^ed  X-server  on  the  display  could 
not  handle  die  calls  coming  from  multiple  transputer  nodes.  Again,  since  the  code  was 
changed  very  litde,  it  took  very  litde  time  to  implement  the  third  version. 
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The  final  strategy  was  to  eliminate  the  X  calls  from  the  tnrsputer  nodes  entirely.  This 
involved  learning  the  XDR  utilities  for  exdianging  data  between  the  tran^uters  and  the 
host  As  a  result  it  totrfc  somewhat  longer  to  implement  this  version  of  the  algorithm  than 
die  previous  two  variations.  However,  it  was  wordi  die  effort  as  this  produced  the  most 
reliable  implementation,  and  it  was  almost  as  fast  as  implementation  three. 


The  following  table  shows  the  relative  performance  of  the  different  implementaticxis. 
The  measurements  given  are  seconds;  the  first  number  is  the  elapsed  amount  of  processing 
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93 

15 
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9 
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B.  Tool  Summaries  **Data  Sheets” 

This  section  contains  the  descriptions  of  all  the  tools  and  environments  surveyed.  It 
includes  bodi  tools  that  were  included  in  the  original  survey  report  and  other  tools  that  were 
discovered  later.  It  is  organized  into  parallel  environments,  FORTRAN  parallelism  supptHt 
tools,  special  purpose  tools,  and  parallel  langua^s.  All  tools  are  described  in  the  present 
tense  in  terms  of  their  capabilities  even  though,  as  discussed  earlier  in  the  reporu  a  lot  of 
diese  projects  have  ende^  and  the  tools  are  no  longer  available. 

Parallel  Environments 

CODE 

Develqied  at  the  University  of  Texas,  die  Computati(»-Otiented  Di^lay  Environment 
(CODE)  provides  high  level  parallelism  support  for  a  parameterized  range  of  target 
architectures.  It  includes  a  graphical  editor  monitoring  facility  for  program  graphs  of 

schedulable  units  in  (currently)  three  program  languages,  FORTRAN,  Ada  and  C.  The 
CODE  effort  is  closely  integrated  with  another  environment  project,  die  Reusal^ty- 
Oriented  Programming  Enviroiunent  (ROPE)  at  Austin.  This  shares  the  program  database 
with  CODE  and  allows  intermixing  of  program  con^ionents  developed  under  either  system. 
The  program  graphs  are  hierarchical  compositions  of  other  units  in  either  dw  grtqdi 
language  or  the  base  language.  These  four  kinds  d  units  are:  contrr^ling  units,  filters  (used 
for  ccnsiraints),  dependency  lists  (involving  structures  of  related  units),  and  subgraphs. 

Target  architecture  models  ate  described  in  a  TOAD  (Translator  Of  A  Declaradon) 
which  includes  synchronization  informadon  mi  target  configurations.  TOADs  exist  for 
Ada,  FORTRAN,  and  various  extensions  of  C,  for  target  environments  running  on  a  variety 
of  parallel  machines,  including  a  Sequent  Balance,  a  VAX  cluster,  a  Cedar  (clusters),  an  Intel 
Hypercube,  and  a  Cray  X-MP.  Some  executable  FORTRAN  modules  that  have  been 
published  use  the  Schedule  execution  environment  for  synchronization  operations, 
indicating  that  this  approach  is  amenable  to  suppcxting  common  target  oivironmaits  in  a 
natural  way. 

The  develc^nnent  path  using  CX30E  is  to  edit  a  graph  of  program  components  called 
Schedulable  Units  of  Confutation  (SUQ.  This  is  done  under  ccMitrol  d  an  X-based  menu 
system  witii  fifteen  icons  for  drawing  graphs  of  SUC  relationships.  One  icon  supports 
general  attribute  editing  actions  and  thus  provides  a  fully  non-linguistic  model  of  ^ta  flow. 
Code  in  the  SUCs  themselves  is  in  one  of  the  supported  languages,  Ada,  FORTRAN  or  C. 
Input  data  dependencies  are  modeled  as  input  parameters.  CXitput  data  dependencies  can  be 
oufut  parameters  in  Ada,  or  read- write  parameters  in  FORTRAN  and  C.  Mutual  exclusicm 
uses  Ada  tasks  or  global  variables  synctronized  by  appropriate  language  extensions  for  the 
supported  target  For  FORTRAN,  one  possible  infor^  standard  is  die  Schedule  runtime 
intt^ace  as  initially  done  for  the  Encore  Multimax  at  Argonne  Labs.  The  Cray  and  IBM 
extensions  use  the  multitasking  supported  by  the  vendors'  FORTRAN-77  compilers.  Any 
micro-patallelism  features  can  be  u^  by  programming  them  into  the  code  for  die  SUCs. 
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The  main  shortcoming  of  this  approach  is  that  peculiarities  oi  the  target  language  and 
architecture  are  exaggerated.  The  lack  of  portability  in  FORTRAN  is  made  explicit  in  the 
need  for  a  different  TOAD  for  every  cominnation  of  target  and  operating  system.  The 
optimization  problems  of  Ada  tasking  makes  synchronization  very  expensive.  Each  node 
depending  on  exclusive  access  to  any  data  is  modeled  as  a  separate  task.  This  situation 
should  be  crnisiderably  irtqtroved  by  using  Ada  9X  protected  operations.  Exclusive  access 
to  shared  data  can  then  be  implemented  with  protected  operations.  CODE  should  be 
changed  to  support  this  improvement  if  Ada  9X  is  supported. 

CODE'S  separation  of  target  environments  rqnesents  an  in^xirtant  research  effort,  but 
its  real  application  potential  is  limited.  The  main  reservations  CODE  developers  express  are 
that  it  maybe  difficult  to  model  architectures  significandy  different  firom  dieir  initial  set  In 
particular,  fine  grain  parallelism  is  supported  by  the  users  knowledge  of  die  problem  and  the 
compiler  for  die  chosen  target  language.  The  decomposition  of  the  algorithm  as  done  by 
transformations  or  by  other  preprocessor  approaches  is  done  entirely  when  specifying  the 
schedulable  units.  In  some  ways,  diis  is  id^  foe  developing  programmer  intuition  and 
skills,  since  die  approach  provides  rapid  ta^et  specific  poformance  information.  The  user 
can  decompose  Ms  problem  according  to  different  arcMtectural  considerations,  and  execute 
the  models  on  the  actual  hardware  using  the  supported  compilers.  What  cannot  be  easily 
doat  is  to  record  die  process  of  generating  the  different  variants.  What  would  be  needed  is 
support  for  experimentatiem  in  expressing  the  algorithm  using  different  mappings,  to 
complement  CODE'S  capability  providing  feedback  on  die  effects  of  architectural 
deamposition  for  a  particular  algoridtmic  fomnilation. 

The  developer's  of  CODE  are  about  to  announce  a  new  versimi  diat  addresses  some  of 
these  issues.  They  have  unified  the  concepts  related  to  specifying  synchronization 
expressions,  into  classes  of  operations  diat  permit  more  effective  translation.  TMs  has  the 
effect  of  decoupling  transformations  diat  potentially  interfere  for  very  goieral  data 
dependencies.  In  effect,  they  ate  supporting  the  derivation  of  specific  categories  of 
synchronization  operations  fiom  die  more  general  relations  of  SUCs  (now  called  just  UCs) 
allowed  in  the  first  version  of  CODE.  This  should  tend  to  support  exactly  the  kind  of 
successive  refinement  (under  deterministic  transformations)  that  is  needed  to  address  the 
variant  issue  raised  above.  Presumably,  this  involves  improvements  in  ROPE  to  record 
related  configuratitxi  items.  Ihis  seems  a  likely  direction  that  hopefully  will  be  pursued  in 
the  future. 


Express 

Express  is  a  commercial  parallel  programming  environment  designed  to  address  basic 
development  issues.  It  includes  tools  for  programming,  debugging  and  performance 
monitoring  of  distributed  hardware  platforms.  We  considered  EXPRESS  as  a  substitute  for 
the  Meiko  CS  Tools  that  come  with  die  Transputer,  but  rejected  it  since  it  forces  die 
transputers  into  a  hyper  cube  configuration.  Ifowever,  EXE^IESS  is  available  on  several 
platforms  in  addition  to  the  transputer,  including  a  network  of  workstations  which  may 
make  useful  in  some  applications  for  the  PEEP. 
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Faust 


Faust  is  a  modem  environment  for  supercon^uters  devel(^)ed  at  the  Center  for 
Supercomputing  Research  and  Development,  at  the  University  of  Illinois,  in  Champagne- 
Uibana.  Supporting  Butterfly,  Cedar,  and  Ardent  targets,  Faust  is  basically  a  supercon^uter 
resident  environment,  but  is  "conventional  UNIX  compatible”,  and  can  be  run  on  Sun 
workstations.  It  has  functions  for  user  interface,  grtqph  man4>ulation,  and  text  processing.  It 
provides  a  high-level  grt^hical  view  of  function  and  taslc  relationships.  Its  multi-language 
editor  supports  FORTRAN  and  C  with  various  extensions,  as  well  as  a  Pascal-based 
fuiKtional  language.  It  includes  a  graph  browser  and  a  dynamic  parallel  program  nxrnitoring 
tool.  All  its  graphics  use  X  for  portability. 

Faust  seems  to  be  the  most  immediately  usable  of  the  current  enviroiunents  for 
parallel  programming  in  conventional  langur^es.  In  particular,  its  pittgram  database  and 
intelligent  editor  cspdtilities  are  well-integrated  with  the  architecture-specific  optimizatirxi 
capabilities  needed  to  produce  executable  paralld  code.  Its  editor,  called  Sigma,  performs 
vectorizing  optimizatkxis  based  on  knowledge  of  die  tar^  architecture,  tqrplying  program 
transformations  under  user  control.  Optimizations  have  tire  results  of  dynamic  analysis 
available,  using  information  fed  back  fimn  monitoring  during  execution. 

Faust  inserts  program  monitoring  in  a  systematic  way  as  part  of  its  preparation  (rf  a 
parallel  program  for  a  given  architecture.  The  information  collected  becomes  part  of  the 
application  program  database  as  either  performance  statistics  or  da>a  dependence 
information.  The  dynamic  flow  and  dam  usage  information  can  be  used  by  optimizing  tools 
for  global  flow  analysis  and  for  determiiung  when  particular  program  transformations  can 
be  ^plied  during  architecture  mapping.  Also,  Faust  provides  a  dynamic  call  graph  animated 
view  of  a  program  execution  that  di^Iays  tiie  dyruunic  execution  of  a  parallel  program  in 
terms  of  its  original  subroutine  call  grq)h.  Additions  support  integramd  interactive  display. 
Farast's  Integrated  Multiinocessor  Performance  Analyris  and  Characterization  tool  set, 
IMPACT,  integrates  performance  data  o^ection  with  display  and  analysis  tools.  IMPACTS 
evoit-display  tool  traces  multitasidng  events  and  di^lays  them  in  real  tirr». 

The  Faust  project  appears  to  have  minimal  interest  in  exploratkm  of  the  programming 
language  problem.  Eiiqrhasis  is  on  extensirais  of  FORTRAN  and  C  (including  concurrent 
C  and  C++),  for  its  target  architectures.  Among  the  parallel  environments,  Faust  definitely 
provides  tiie  most  advanced  support  for  vectorizing  qnimization.  However,  Faust  has  little 
capability  to  deal  with  algorithms  that  are  parallel  at  a  very  high  level,  such  as  data  flow  and 
parallel  gr^h  algoritiuns.  It  concentrates  ot  solidly  supporting  programmability  in 
ccMiventional  languages  for  a  few  related  architectures,  but  provides  no  basis  for 
experimentation  with  different  parallel  programming  languages  or  methodologies. 


ISSOS 

ISSOS  is  a  programming  envirorunent  developed  at  (%io  State  University  that 
supports  prototyping  <rf  parallel  applications.  It  also  supports  experimentation  that  is 
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directed  towards  improving  the  performance  of  parallel  algorithms  executing  on  the  a 
network  of  UNIX  machines. 


The  system's  Program  Construction  System  (PCS)  uses  an  object  oriented  dialect  of 
C  called  COOL  to  prototype  ihc  application.  The  PCS  is  equipped  with  a  syntax  directed 
editor  and  a  CXX^L  to  C  source  code  generate  for  eventual  execution  in  a  UNIX 
environment  diat  supports  mechanisms  fev  process  creation,  scheduling  and  inter-process 
communication. 

ISSOS  tools  help  the  programmer  cream  new  "program  adaptations"  that  differ  from 
the  original  widi  respect  to  performance.  Static  adaptations  are  iiiq)lemented  by  including 
"Program  Adaptation"  statements  in  the  program.  These  statements  control  load  balancing 
and  can  effect  die  execution  of  a  process.  In  addition  to  adaptation  statements,  the 
Adaptation  Ccmtroller  (AQ  is  aware  of  die  physical  characteristics  of  die  target  and  can 
modify  the  allocation  of  resources  to  the  running  program  acct»dingly. 

The  AC  also  does  dynamic  tuning  by  interpreting  data  about  program  behavior 
collected  by  Program  Monitors  which  exist  on  nodes. 

ISSOS  is  still  very  much  a  research  efft^  with  even  its  language  (COOL)  in  some 
process  of  evolution.  That,  combined  with  its  en^hasis  on  large  grained  parallelism,  limits 
its  applicability  to  the  PEEP. 


Omega/PegaSys 

Omega  was  developed  at  Stanford  UniverriQr  in  die  early  80’s  as  a  non-linguistic 
programming  environment  It  supported  input  of  programs  in  mixed  form,  including 
traditional  algebraic  form  for  expressions,  graphs  for  control  structure,  and  icon 
representations  for  program  structures.  It  combined  diis  with  a  relational  program  database 
that  supports  development  through  successive  refinement  by  providing  relations  as 
sequential  interpretations. 

Omega  pioneered,  and  PegaSys  continues  the  development  of,  the  use  of  "software 
throu^  pictures"  and  very  high  level  non-traditional  programming  technique.  In  PegaSys, 
the  user  describes  how  a  program  is  built  using  a  hierarchically  structured  collection  of 
pictures  called  formal  dependency  diagrams  (FDDs).  PegaSys  is  in^lemented  in  Inteiiisp- 
D  and  runs  on  Xerox  1  KXl-series  personal  computers,  and  the  only  language  cutiendy 
supported  is  Ada. 

There  ate  three  majOT  components  included  in  PegaSys: 

•  hitetface  Tools 

•  (fieraichy  Manager 

•  Program  Verifier 

The  interface  tools  include  a  di^lay  management  package  and  also  a  package  that 
maps  graphical  q)eration  into  logical  operations.  Also  included  with  these  tools  are 
graphical  and  textual  structure-oriented  editors,  a  pretty  printer,  and  a  database  which 
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maintains  FDDs  and  programs.  The  hierarchy  manager  ensures  that  each  level  of  an  FDD 
in:4)lements  die  levels  above  it  correcdy  and  also  aids  in  the  construction  of  the  levels  of  an 
FDD.  The  program  verifier  determines  whedwr  an  FDD  is  logically  ctmsistent  with  the  Ada 
program  it  describes. 


PARET 

Developed  at  AT&T  Bell  Laboratories,  PARET  is  a  system  desoipdtm  and  simulation 
language  oriented  to  architecture  design  and  investigatitXL  It  supports  grtydiical  layout  of 
processors  and  division  of  parallel  functions  among  processing  elements  (PEs),  switching 
elemoits  (SEs),  and  communication  elements  (CEs)  (in  networks  of  CPU’s).  It  primarily 
simulates  machines  with  a  MIMD  architecture  without  shared  memory.  It  supports 
graphical  layout  of  these  elements  widi  arcs  representing  data  and  contrtd  flow,  and  allows 
association  of  capacities,  buffers  and  timing  delays  with  their  inteicoimections.  During 
simulation,  information  can  be  passed  to  analyzers  of  time  dependent  sequences  to  generate 
control  traces  and  calibrated  dynamic  utilization  displays  call^  meters.  Computation  witiun 
a  processor  can  be  simulated  by  a  delay,  or  the  conputation  can  actually  be  performed  if 
written  in  C  or  C++. 

The  PARET  user  interface  provides  a  fixed  screen  configuration  tiiat  includes 
windows  for  the  display  of: 

•  the  active  parts  of  the  system  or  subsj^tem  graph 

•  the  set  up  of  simulation  parameters  and  flow  graph  database 

•  the  graphs  or  meters  of  significant  activities 

Nodes  and  meters  provide  significant  measures  of  system  performance  and,  typically, 
are  used  to  determine  if  a  given  design  is  csq)able  of  satisfying  a  set  of  design  goals.  All 
nodes  and  meters  assigned  to  one  processor  are  shown  in  the  same  color. 

This  is  a  good  exan^le  of  a  system  simulation  ai^noach  that  would  be  needed  for 
realistically  modeling  a  wide  varies  of  parallel  con^uters.  However,  it  is  different  from  the 
above  work  on  parallel  programming  because  it  concentrates  on  hardware  description  and 
provides  no  support  for  advanced  software  development  methods. 


PIE 


The  Programnting  and  Instrumentation  Environment  (PIE)  is  a  programming 
environment,  developed  at  Carnegie  Mellon  University,  that  pioneered  the  iq)plication  of 
environment  concepts  to  parallel  program  analysis  and  development  PIE  is  composed  of  a 
Modular  Programming  (I^  metalanguage,  a  program  constructor,  and  an  inq>lementation 
assistant  The  MP  metalanguage  provides  support  for  the  efficient  manipulation  of  parallel 
modules  and  fast  parallel  access  to  shared  dt^  constructs.  The  program  constructor  (PCT) 
provides  the  capability  to  generate  efficient  parallel  programs.  The  PCT  is  itself  composed 
of  three  elements,  a  h^-oriented  editor  (MTOE),  a  status  and  performance  monitor,  and  a 


-67- 


reladonal  represratation  system.  The  implementation  assistant  provides  semantic  support  in 
the  parallel  progiam  develr^niient  cycle. 

The  language  and  c(»cq)ts  of  PIE  are  very  good,  but  are  somewhat  specialized  in  their 
suppmt  of  a  particular  model  of  parallelism  (shared  memory  multi-processing  targets  with 
m^um  to  h^e  grain  parallelism).  PIE  is  a  good  example  of  an  early  parallel  programming 
system,  however,  die  PEEP  requires  a  more  general  ability  to  model  oAer  parallel 
methodologies. 


POKER 

POKER  is  a  parallel  programming  oivironment  develqied  at  die  University  of 
Washington  that  uses  a  visual  representadon  ro  describe  parallel  programs.  Poker  assumes 
an  underlying  message  passing  non-shared  memory  architecture. 

Programs  are  constructed  creating  and  manipulating  visual  or  textual 
representations  of  informati<»  about  the  program  called  "views".  The  heart  of  a  jnogram  is 
its  communication  structure  v^iich  is  represented  by  a  communicating  graph.  Nodes 
represent  processes.  Edges  describe  communication  to  ports  on  other  nod^.  Pmts  are  also 
represented  graphically  as  a  view.  The  process  views  are  defined  textually  in  "Poker  C"  and 
labeled  to  coTTKpond  to  vertices  in  the  graph.  "Poker  C"  is  similar  to  regular  C  except  it  has 
an  additional  construct  to  describe  inter-process  communication.  There  are  also  VO  views 
that  associate  dangling  edges  with  input  or  ouqmt  streams. 

A  Poker  program's  "views"  plus  the  source  code  for  the  processes  are  stored  in  a 
program  database  which  is  compiled  and  can  be  run  on  a  simulator.  The  simulator  is 
equipped  with  a  'Trace  view"  that  includes  a  timing  model  Trace  view  lets  die  programmer 
stop  the  program,  single  stqi  tiuough  it  and  monitor  traced  variables.  Polmr  programs  also 
run  on  several  parallel  conputers  including  the  Pringle  parallel  conputer  and  the  CalTech 
Cosmic  Cube. 

Poker  has  been  used  by  over  SO  institutions  fcv  5  years  and  a  large  number  of  parallel 
programs  have  been  written  with  it  This  eiqierience  confirms  the  notion  that  visual  support 
for  parallel  programmers  inproves  productivity.  This  approach,  as  inplemented,  is  liinited 
because  lar^  programs  are  difficult  to  input  graphically,  and  are  confiising  to  view.  A  user 
is  forced  to  construct  large  programs  from  smaller  building  blocks.  Consequently  Poker 
supports  a  paradigm  for  algorithm  construction  based  on  decomposition.  Pdcer  users 
^pically  decompose  physical  problems  into  a  group  of  independent  parallel  programs  that 
are  each  defined  1^  a  communication  graph  and  its  associated  views.  These  programs  are 
called  "phases",  and  a  group  of  them  are  executed  to  solve  a  complex  problem.  This 
approach  encourages  modulariQr,  reusability,  and  graphical  representation  program 
components.  However,  die  synchronization  between  phase  executions  is  inefficient  and 
there  is  no  mechanism  to  build  single  executable  pplications.  The  programmer  is  forced  to 
run  phases  one  after  another. 

A  new  parallel  environment  called  Orca  is  being  planned  by  the  Poker  researchers  to 
correct  these  deficiencies. 
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Prometheus 


Promedieus  is  a  UNIX-based  environment  originally  developed  at  Virginia 
Polytechnic  Institute  and  State  University,  in  Blacksburg.  It  has  a  Backus  FP  "combining 
forms"  functional  language,  provides  three  different  ways  of  constracting  programs  (top- 
down  by  graphical  editor,  bottom-up  by  language-sensitive  editor,  and  iteratively  by  graph 
transformatitm),  and  executes  1^  an  extended  Boume-shell  with  a  graphical  debugger.  It 
includes  a  Tool  Description  Language  for  functional  extension. 

For  dynamic  analysis,  Prometheus  provides  four  views  of  dynamic  behavior  road 
ma^,  activity  execution  graphs,  invocation  trees,  and  frame  usage.  These  can  use  explicit 
program  points,  or  can  make  use  of  probes  inserted  into  the  interpreted  code  in  a  controlled 
way  to  monitor  block  entry  and  exit  These  probes  are  maintained  in  a  sensor-enable  table 
that  is  checked  each  time  a  sensor  is  encountered.  There  is  a  standard  fmmat  for 
transferring  diis  information  to  the  program  database  for  post-mmtem  analysis.  Probe 
insertion  and  enabling  are  bodr  under  user  control,  to  keep  intrusion  to  a  minimum. 

This  is  a  powerful  enviraunent  which  could  be  a  framework  for  program  analysis, 
including  investigation  ci  parallelisrrL  However,  it  includes  no  direct  support  for  "very  hi^ 
level  languages"  and  would  require  extensive  use  of  its  extension  mechanism  to  model  a 
variety  of  parallelism  mediodologies. 


PSG 


Developed  at  the  Technical  University  of  Darmstad,  the  Program  System  (jenerator 
(PSG)  is  protably  the  most  thorough  integration  of  programming  language  definition  and 
verification  techi^ues  with  environment  technology.  PSG  generates  language  sensitive 
editors,  an  interixeter,  and  a  fragment  library  system  using  formal  language  specificaticms 
for  any  user-defined  language. 

The  main  component  of  PSG  is  a  full-screen  editor  which  permits  both  text  and 
smicture  editing.  Prevention  of  both  syntactic  and  semantic  errors  are  guaranteed  when 
using  the  editor  in  structure  mode.  In  text  mode,  the  editor  guarantees  die  immediate 
recognition  of  tiiese  errors.  PSG  has  beoi  used  to  generate  environments  for  ALGOL-60, 
Pascal,  MODULA-2,  and  die  formal  language  definition  language  itself.  The  main  interest 
in  PSG  for  the  rcEP  is  its  use  of  common  type  definitions  in  die  tools  genoration  process 
to  guarantee  intenqierability  of  tools. 


HPDE 

This  is  a  software  process  project  at  IBM  Yorktown  Heights  that  supports  an  object 
oriented  tool  framework  for  integration  of  program  analysis  and  processing  techniques. 
RPDE  has  been  used  successfully  in: 
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•  extending  tool  functionality, 

•  extending  data  donaain  (Stools, 

•  sui^xxrting  new  languages, 

•  integrating  separate  environments, 

•  supporting  new/different  operating  systems. 

RPDE  emphasizes  common  services  along  tools,  including  mappings  from  internal 
tool  data  representation  to  common  interoperable  formats  tiiat  are  a  kind  of  temporary 
common  file  system.  Tte  initial  efforts  have  shown  tiiat  the  common  service  firt^woik 
covers  almost  all  of  the  needed  services  for  new  processors. 

In  their  article  "Support  for  Change  in  RPDE",  H.  Ossher  and  W.  Harrison  wrote: 


The  research  issues  being  pursued  in  connection  witii  Garden  are  similar 
ro  those  of  interest  to  us  in  building  and  using  RDPE.  Both  systems 
ei^loy  a  substantial  functional  frairowoik  of  services,  witii  an  object- 
oiientra  definition  of  conceptual  language  structure  mi  top.  The 
developers  of  Garden  have  eirployed  it  to  explore  gtaphk^ 
programming,  tiie  creation  o£  user  concept  structures,  and  visual  output 
We  have  en^loyed  RPDE  to  explore  ways  of  exploiting  the  structure  of 
professionally  (fcveloped  software  to  solve  in-tiie-large  programming 
problems.  We  have  also  devised  and  demonstrated  tte  usefulness  of 
some  extensions  to  the  object-oriented  paradigm. 

These  are  similar  to  the  goals  of  the  PEEP  and  indicate  possible  directions  to  pursue. 


TOPSYS 

TOPSYS  is  an  integrated  parallel  programming  environment  that  addresses 
monitoring  on  distributed  platforms  using  a  nundber  of  cooperating  techniques  and  tods.  It 
is  currmtly  mily  available  for  the  iPSC/2;  however,  if  the  Sun  workstation  version  cuirently 
being  developed  becmiKs  availattie,  it  should  be  evaluated  as  a  possible  stand  alone  special 
purpose  development  vehicle  for  the  PEEP. 


FORTRAN  Parallelism  Support 


FORCE 

FORCE  is  an  extension  of  FORTRAN  for  parallel  programming  of  scientific 
ai^licatimis  on  shared-memory  multicomputers.  It  is  currently  not  a  supported  tool, 
howevo-,  it  is  available  "as  is"  to  interested  people.  It  is  based  on  tiie  UNIX  tools  sed  and 
m4,  and  there  are  versions  for  Encore,  Sequent,  Alliant  and  Cray  machines.  It  involves 
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parallel  omtiol  structures  which  are  expanded  by  target  dependent  macro's  to  architecture 
specific  miciotasking  calls. 


PAT 


Developed  at  Georgia  Tech,  PAT  (Parallelizing  Assistant  Tool)  is  a  menu  driven  editor 
for  analyzing  and  incrementally  transforming  FORTRAN  programs  to  achieve  greater 

of  availflhlft  parallel  optimizations.  It  analyzes  die  program  for  specific  features, 
including  ctmmxMi  data  usage  and  looping  behavior  and  allows  the  user  to  change  the 
program  under  specific  conditions.  Some  of  the  transformations  can  even  be  dcxie 
automatically,  and  then  subjected  to  user  scrutiny  for  possible  improvement  or  partial 
reversal  This  effort  shows  the  kind  of  user  engineering  that  can  be  expected  to  play  an 
increasing  role  in  parallel  programming  support  efforts.  It  can  draw  on  compiler-related 
efforts  that  are  ba^cally  optimization  projects,  such  as  tiie  project  at  Rice  and  several 

parallel  compiling  projects  of  hardware  voKlors,  and  on  die  program  transftxmatioa  work  of 
such  projects  as  Meta-Crystal  at  Yale  and  E-L  at  Harvard.  PATs  use  of  an  intelligent  editor 
to  control  a  semi-automatic  program  refinement  process  draws  on  work  such  as  Interlisp 
and  the  relationship  of  EMACS  and  LISP  diat  has  a  long  history  at  MIT. 

The  model  used  by  PAT  has  some  inopcatant  cmtral  properties: 

1.  subroutines  are  analyzed  one  at  a  time,  with  information  fiom  nested  calls  used  as 
availabie; 

2.  loops  with  no  dependencies  are  identified  for  optimization  first; 

3.  loops  with  dependencies  are  analyzed  incrementaliy  under  intelligent  editor  control 
using  semi-automatic  refinemoit; 

4.  line  numbers  fixxn  the  original  source  are  preserved  throughout  die  inqirovement 
process  to  allow  maximal  use  of  programmer  knowled^. 

PAT  supports  three  levels  of  interactimi: 

Manual  Mode  -  where  die  user  has  txin^ilete  control  over  transformations. 

Power  Steering  Mode  -  where  die  tool  makes  suggestions  using  available  information 
and  follows  user  directions. 

Auto  Pilot  Mode  -  where  die  tool  will  perform  sequences  of  transfomiations 
indqiendendy. 

The  last  mode  is  still  a  research  area,  but  activity  appears  to  be  progressing.  The 
program  regions  analyzed  by  the  PAT  under  user  control  are: 

•  subprograms, 

•  loops, 

•  loops  with  governing  if  conditiems, 

•  parallel  regions  (TBD). 

An  irr^iortant  aspea  of  subprogram  aiialysis  is  control  over  die  amount  of  called 
program  information.  Basically  complete  analysis  is  needed  to  handle  all  possible 
dependencies.  However,  this  is  very  expensive  and  may  normally  not  provide  much 
infonnation.  The  PAT  solution  will  be  to  allow  the  user  to  decide  how  much  of  the  call 
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graph  to  use  ai.  Full  subprogram  inlining  is  possible  in  the  limit  The  user  may 
control  how  mudi  of  the  call  gn^h  is  included  for  analysis  and  make  use  of  the  multiple 
window  browsing  to  coordinate  which  regions  to  be  used.  In  this  static  analysis  browsing 
mode,  none  oi  the  program  transftnmation  acticms  may  be  invoked. 

Program  transformations  are  under  control  of  a  "sub-menu".  Actions  include  adding 
to  and  deleting  from  source  and  moving  sections  of  code.  Dependencies  can  be  affected  by 
adding  protection,  adding  alignmoit  for  loop  indices,  and  lepUcating  variables  and 
asMgmnents.  Some  of  diis  can  be  automated,  since  if  a  depencfence  graph  is  a  tree,  all  the 
dependencies  may  be  removed  subscript  alignment  This  should  be  the  kind  of 
q>timizati(xi  that  will  increasingly  be  done  automatically  as  PAT  is  improved.  More  globally 
optimal  decisicms  such  as  process  paitidoning  should  continue  to  be  done  under 
programmer  control. 

More  ambitious  optimization  is  needed  for  real  architectural  tailoring.  A  more  flexible 
approach  would  use  PAT  techniques  to  examine  existing  FORTRAN  programs  for 
opportunities  for  annotating  parallel  computations.  A  second  phase  similar  to  the  FORCE 
tool  for  programming  portable  FORTRAN  with  parallelizable  loc^s  would  then  be  needed. 


Pisces 

Developed  at  the  Urdversi^  of  Virginia,  Parallel  Inq)lementation  of  Scientific 
Oxiqmter  Bivironments  (PISCES)  is  a  FORTRAN-based  environment  with  concrete 
language  extensions  for  tasking  and  parallel  blocks.  It  runs  on  a  single  processor  VAX 
using  UNIX  processes  for  parallel  execution,  or  a  network  of  Apollo  workstations.  It 
depends  on  compiler-based  analysis  for  higher-level  program  views  and  requires  interpreter 
ad^timis  for  desding  with  different  architectures.  The  main  interest  in  Pisces  is  in  the  form 
of  its  task  declarations,  which  are  "clusters"  having  intrinsic  properties  (including 
"controllers",  "subtasks",  and  message  handlers).  Its  parallel  blocks  are  concurrent  C-style 
constructs.  This  project  provides  little  basis  for  experimenting  with  either  parallel  languages 
or  hardware.  Its  FORTRAN  extension  would  pro^e  a  possible  "interme^te  language"  to 
which  veiy  high  level  languages  could  be  translated.  Th^  programs  in  this  language  would 
have  to  be  translated  further  to  lower  level  FORTRAN  as  suppcmed  by  standard  compile. 
However,  the  sequoitial  FORTRAN  produced  by  die  PISCTES  preprocessor  may  not  be 
easily  parallelized  by  anodier  machine's  paralleling  conpiler. 


This  is  Rice  University's  FORTRAN  environment  for  parallel  architectures.  It  has  the 
most  advanced  compiler-ba^  analysis  of  the  environments  surveyed.  It  utilizes  most  of  the 
known  static  analysis  and  optimization  techniques  and  performs  vectorizing  optimizations 
for  most  available  parallel  architectures  in  the  context  of  a  conqilete  optimizer  and  historical 
program  database.  R'*  is  still  a  standard  for  supercomputer  FORTRAN  language 
environments.  The  de  /elopers  have  deliberately  avmd^  ever  becoming  multi-language  or 
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including  any  hi^er  level  fonnalisms  in  the  source.  All  program  development  is  dtxie  using 
a  program  database  and  associated  analysis  tools.  The  latest  extensions  support  using 
execution  information  to  irr^rrove  optimization.  By  design,  this  effort  provides  no  supptxt 
for  high  level  parallelism,  and  so,  is  primarily  useful  in  the  PEEP  effort  for  its  ideas. 


Schedule 

Developed  at  Argotuie  National  Laboratories,  to  provide  portability  of  scientific 
parallel  applications.  Schedule  is  a  FORTRAN  library  of  parallel  support  routines  which 
allow  application  programs  to  perform  synchrortization  and  scheduling.  It  runs  on  VAX, 
Cray  arxl  various  multicorr^uter  hardware.  It  provides  a  way  of  dealing  with  limited 
operating  system  and  architecture  variatitms  as  a  procedural  extension  to  FORTRAN. 


Special  Purpose  Tools 


BAISA-n 

B  ALSA'II  is  a  program  annotation  and  monitming  system  under  development  at  DEC 
System  Research.  It  is  an  outgrowtii  of  the  BALSA  program  originally  devel(^)ed  at  Brown 
University.  It  involves  programmer  specification  of  significant  program  points,  and  use  of 
programmed  tracing  routines  to  cause  system  monitoring  and  data  collection  at  run  time. 

The  data  obtained  is  used  to  monitor  execution  in  whatever  is  determined  to  be  the  most 
critical  or  significant  way.  Typically,  this  invdves  data  usage  or  execution  frequency 
information  that  can  be  used  to  significantly  improve  algoritiun  performance. 

B  ALS  A-n  provides  a  graphical  interface  which  allows  the  user  to  vary  tiie  point  of 
view  under  control  of  a  mouse.  It  communicates  with  the  program  viewer  to  display  the 
sectitm  of  the  program  grq)h  affected.  The  alnlity  to  view  results  is  important  fcv  providing 
teal  data  to  gi^  performance  improvanents.  It  is  not  specifically  related  to  parallelism,  but 
could  be  used  to  explore  algotithins  executing  in  parallel  by  assigning  monitoring  elements 
to  individual  processing  units.  The  importance  of  its  features  cannot  be  oveterr^hasized  in 
trying  to  characterize  the  behavior  of  a  parallel  system. 

Trace  data  is  often  very  difficult  to  interpret  mearungi'ully,  and  often  there  is  a  need  to 
relate  behavior  at  different  program  levels.  BALS A-II  provides  titis  kind  of  control,  and 
includes  a  user  interface  that  enables  dynamic  selection  of  monitored  events  and  traces.  The 
user  can  have  the  effect  of  "zooming"  and  "unzootning"  to  get  a  valid  feel  for  how  much 
activity  is  occurring  in  particular  subsystems  and  subprocesses. 

BALS  A-n  is  a  methodology  for  annotating  and  interpreting  programs  of  any  kind  and 
does  not  depend  on  the  language  or  the  kind  of  probleriL  Iititially,  it  makes  use  of  an 
extended  P^al,  but  it  can  support  any  language.  It  makes  good  "meaningful  use"  of  color 
in  its  graphics,  shows  the  possibilities  of  graphics  in  program  monitoring,  and  suggests  the 
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inqportance  of  user-supplied  high-level  structuring  informaticni  for  programs  in  any 
iiiq)lementati(m  language. 

Since  the  Garden  system  has  taken  many  of  the  same  ccmcepts  and  integrated  diem 
into  it's  muld-lingual  graphics  front  end,  th^  concepts  are  available  in  either  system. 


DaUk 

Dalek  is  a  debugger  develqied  at  the  University  ttf  California  at  Davis  to  run  on  top  of 
traditional  debuggers  such  as  sdb.  It  is  primarily  an  event  oriented  debugging  Ir  nguage 
useful  in  any  dynamic  program  analysis,  inclut^g  tradititmal  testing.  It  allows  logical 
qierations,  queries,  conditional  execution-time  actions  on  events,  and  mcMiitoring  and 
profiling  under  user  control.  It  can  be  used  like  con^iler  and  linker-inserted  probes  fcH*  run¬ 
time  monitming  and  can  perform  dataflow  analysis  of  event  relationships  like  intelligent 
debuggers.  It  relies  on  user-supplied  event  definitions  and  can  be  used  with  any  language. 
Dalek  uses  the  techniques  of  tradidtxial  debuggers,  but  is  more  powerful  in  that  its  event 
language  is  a  kind  of  monitoring  meta-language.  Its  role  as  a  sequential  debuggingAesting 
tool  could  be  adapted  to  support  parallel  program  tracing  and  performance  analysis. 


E-L 


E-L  (for  Lnvironment  and  Language)  is  a  transfonnational  firameworic  develqxd  at 
Harvard  and  supported  by  Software  Options,  Inc.  for  refining  programs  using  a  variety  of 
software  methodologies  and  for  compiling  to  many  targets.  E-L  supports: 

•  a  flexible  linguistic  medium  to  allow  the  user  to  specify  the  domain  of  discourse 

•  an  open-ended  tool  set  to  allow  existing  and  new  tools  to  interact  easily 

•  a  base  of  operations  and  principles  of  extension  for  manipulating  programs 

While  based  on  principles  of  language  extension,  the  E-L  approach  is  different  from 
standard  language  coo^iling  in  that  ixograms  are  manipulated  by  tools  whose  base 
transformatitms  are  modified  to  produce  interinetable  E-L  at  successive  levels  of  semantic 
refinement  The  final  level  is  a  program  using  primitives  close  enough  to  the  desired 
hardware  operations  that  standard  code  generation  techiuques  can  be  used  to  ccxnpile  to 
executable  programs. 

E-L  has  been  successfully  used  to  naodel  a  UNITY  con^iling  process  that  provides 
UNITY  syntax  and  semantics,  as  well  as  a  facility  for  structuring  UNITY  programs  to  allow 
checking  of  meatungful  program  composition. 


FIELD 

The  Friendly  Integrated  Environment  for  Learning  aixl  Development  (FIELD)  is  a 
interactive,  UNIX  workstation  based  programming  environment  that  runs  on  top  of  X- 
windows.  Devel<^)ed  at  Brown  University,  it  provides  a  variety  of  tocds  for  program  and 
data  visualization  and  offers  a  consistent  and  integrated  graphical  interface.  FIELD'S  main 
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feature  is  a  central,  selective  broadcasting  mechanism  that  lets  new  tools  be  integrated  into 
die  system  with  litde  effort  in  the  future. 

The  tools  currently  integrated  into  the  FIELD  environment  include  a  syntax  direaed 
editOT,  debugger,  cross  referencer,  data  structure  displayer.  Make  interface,  profiler,  and 
general  system  viewer. 

GARDEN 

Develt^ied  at  Brown  University,  GARDEN  is  a  "visual  programming  environment” 
that  has  been  used  to  model  a  wide  variety  of  parallel  programming  methodologies.  It 
supports: 

•  rapid  implementation  of  design  languages  to  facilitate  "conceptual  programming" 

•  object-oriented  program  graphs 

•  textual  interface  with  lisp-like  Functional  Programming 

•  primiti  ves  for  object  pn^ierties  including  concurrency 

•  translation  to  C  to  achieve  compiled-code  performance  of  the  simulation 

GARDEN  has  been  used  to  model  data  flow  systems,  finite  state  path  languages,  Petri 
nets,  a  CSP-like  ports  language,  Unda  and  a  Multilisp  system.  Extensive  engineering  of 
user  interfaces  has  been  done  to  support  nxxiitoring.  There  is  a  capability  for  supporting 
many  windows,  monitoring  specifically  programmed  events,  and  the  interpreter  allows  any 
desired  arrangement  of  processes  and  processors  to  be  monitored  at  varied  levels  of 
descripticxi. 

like  many  of  the  parallel  environments,  GARDEN  supports  instrumentation  either 
automatically,  by  inserting  non-intrusive  probes  at  sigruficant  points,  or  by  programming. 
This  data  can  be  used  with  a  program  viewer  for  value  tracing,  or  to  naonitor  significant 
activities,  or  can  be  analyzed  to  generate  various  kinds  of  control  traces  or  dynamic 
utilization  displays.  For  object  trunitors,  GARDEN  commurucates  with  the  program  viewer 
to  display  the  section  of  the  program  graph  affected. 

As  a  programming  environment  GARDEN  is  incomplete  in  that  it  supports  neititer  the 
reading  of  source  code  for  the  modeled  languages  into  the  GARDEN  environment  nor  the 
output  of  objects  from  the  system  into  their  textual  source  format  This  problem  can  be 
solved  by  defining  methods  for  both  reading  and  writing  source  along  with  the  other  objects 
for  the  modeled  language. 


Hypertasking 

Hypertasldng  is  a  software  parallelization  tod  fnr  the  Intel  hyper  cube.  It  is  currently 
available  as  an  unsupported  tool  from  Intel  for  the  iPSCy2.  Currently  it  does  not  run  on  any 
other  parallel  hardware. 
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lPS-2 


Devel(^)ed  at  University  of  Wisconsin-Madison,  IPS-2  is  a  peifonnance  measurement 
tool  for  parallel  and  distribute  programs.  Measurements  can  be  displayed  at  different  levels 
ctf  abstractkHi  including  die  program's  behavior  as  a  whole,  that  of  processes,  procedures 
widiin  processes,  or  primitive  machine  level  activity. 

Programmers  do  not  need  to  modify  their  programs  to  use  IPS-2.  The  raw  data  is 
gotten  throu^  die  use  of  instrumentadon  probes  which  are  contained  in  the  run-dme 
system  and  operating  system  kernel.  The  user  chooses  to  use  them  via  compiler  and  linker 
opdons.  Their  ouqiut  is  kept  in  a  data  pool  and  analyzed  whoi  requests  are  made  through 
the  user  interface. 

IPS-2's  graphical  user  interface  lets  the  user  assign  executable  images  to  processes, 
processes  to  machines,  select  stadsdcs  for  display  and  begin  the  execudon  of  die  program 
interacdvely.  IPS-2  then  displays  the  requested  performance  stadsdcs.  Addidonally  the  tool 
locates  botdenecks  by  performing  Cridcal  Padi  Analysis.  IPS-2  can  also  perform  Phase 
Behavior  Analysis  which  automatically  identifies  phases  in  the  execudcxi  history  of  the 
program  so  that  they  can  be  studied  more  carefully. 

IPS-2  has  been  used  to  provide  data  for  analytical  performance  models  of  parallel 
systems,  and  to  measure  the  characteristics  of  parallel  database  join  operations,  parallel 
searches,  and  networic  flow  [ffograms.  User  finback  suggests  the  tool  is  easy  to  use  and 
useful. 


MPD 

Multi-Threaded  Debugging  is  wotk  done  at  Columbia  University  using  Mach  C 
Threads.  The  methodology  uses  global  control  flow  aiudysis  to  identify  partial  orders  of 
events,  and  to  construct  predecessor  automata.  These  automata  are  used  to  instrument 
programs  and  provide  meaningful  characterization  oi  debugging  and  trace  information.  The 
debugging  language  allows  the  user  to  assist  in  analysis  and  to  use  identified  events  in  path 
expressions  to  control  nxMiitoiing  activity  and  filter  trace  data 

This  is  a  general,  language-independent,  debugging  technique,  including  sufficient 
source  analysis  to  support  tracing  of  parallel  execution.  It  can  be  used  with  other  language 
preprocessor  and  transformation  teclmiques  to  inrovide  support  for  visualization  used  for 
eitlwT  algorithm  understanding  or  performance  analysis. 


Olympus 

This  is  an  interactive  modeling  system,  however,  its  graphical  interface  is  Sunview  not 
X.  It  is  unusual  for  people  to  still  be  using  Sunview,  so  it  might  be  worth  investigating 
whether  the  developers  have  moved  to  X  since  we  last  mllcfid  to  them. 
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PADWB 


Parallel  Algorithm  Development  Woritbench  is  a  tool  that  facilitates  the  evaluation  of 
algcffithms  by  simulatim.  The  user  models  die  consmunication  overhead,  inter-connect 
strategy,  CPU,  etc.  tt>  obtain  a  graphical  description  of  die  system's  behavior.  The  output 
shows  the  relationship  between  communication  and  CPU  usage  including  idle  times  and 
periods  of  overiapped  processing  and  communication  for  all  processors  involved. 


The  modeling  is  dectmiposed  into  three  layers:  Network,  Cluster,  and  ProcesscH-  layer. 
There  is  a  software  interpreter  that  accqits  the  code  under  test  as  input  and  a  Graphics  post¬ 
processor  that  gather’s  combined  input  from  the  other  layers  and  generates  ou^ut. 

Basically  die  algoridim  is  interpreted  by  the  "software  interpreter”  vihich  generates  an 
instruction  stream  to  the  "processor  flow  modd".  This  layer  does  instructimi  counts  and 
accounts  for  the  relative  speed  of  different  instractions.  This  stream  is,  in  turn,  fed  into  the 
"cluster  model"  which  m^els  a  group  of  a  processor’s  contention  for  shared  resources.  It  is 
here  diat  time  required  fm*  bus  access,  memory  access  and  I/O  are  accommodated. 

The  inter-connection  between  processors,  and  the  cost  cmnmunicaiion,  is  modeled 
by  a  "network  model".  This  section  is  the  heart  of  die  simulation  as  it  relates  to  parallel 
processing.  It  can  represent  various  inter-connection  methods  and  demmistrate  their 
suitaMUty  to  a  proposed  algoridim. 

All  the  models  are  process  oriented  next  event  simulations  written  in  Simula  with 
DEMOS  extensions  and  in  C.  Some  component  behaviors  are  modeled  by  stochastic 
assumptions.  Performance  is  a  concern  widi  this  type  of  simulation  especially  when  very 
detailed  models  are  used.  Another  consideration  when  assessing  the  value  of  such 
experiments  is  the  assumptions  that  are  made  with  respect  to  exogenous  evmts.  These 
assun^tions  will  to  a  large  degree,  determine  the  accuracy  of  the  results. 

Work  on  this  tool  at  Gould  is  on  going  and  as  of  1988,  the  networic  layer  was 
incomplete,  although  parallel  experiments  were  being  conducted  by  using  the  processor 
layer.  The  experiments  were  small,  involving  known  problems  (matrix  multiply)  widi  only  a 
few  processors  (2). 

PAWS 

Developed  at  Syracuse  University,  PAWS  (Parallel  Assessment  Window  System)  is 
an  integrated  X-Windows  based  toolset  for  evaluating  existing  or  hypothetical  machine 
architectures  running  a  common  applicatkm.  It  translates  source  programs  from  a  high-level 
source  language  into  the  form  of  a  ^taflow  grtph.  It  maps  diis  intermediate  form  into  an 
executaUe  characterization  using  target  machine  description  miodels  that  currently  reflect  a 
SIMD  (Thinking  Machine's  CM-2),  a  shared  memory  MIMD  (Encore  Multimax) ,  and  a 
message  passing  MIMD  (hyper  cube)  architecture.  It  supports  user-controlled  monitoring 
that  relates  dynamic  information  beck  to  the  graph  of  the  source  program. 


The  source  languages  supported  are  SISAL,  a  single  assignment  language  based  on 
stream  (^)erations,  and  Ada.  Programs  are  translated  to  a  grtq)hical  dataflow  language  IFl, 
an  intennediate  language  for  SISAL.  The  partiticxiing  of  processcnrs  and  evaluation  of 
perframance  information  all  malm  use  of  user  controlled  heuristics  and  actions  performed 
during  walking  this  graph.  The  machine  descriptions  include  processors,  memory 
characteristics  and  interconnecdon  topologies.  These  are  created  using  an  interactive  tool 
that  allows  a  conq>lete  hierarchical  descriptirm  of  die  target  hardware.  This  is  also  in  the 
form  of  a  graph,  with  a  structured  arrangement  of  information  about  processor,  memory  and 
network/connecticm  characteristics.  Query  parameters  allow  reference  to  data  from  diese 
structures.  The  query  types  for  performance  data  include  times  for  aridimetic  and  logical 
confutation,  communicadon  costs  and  network  overhead. 

An  interactive  display  tool  provides  the  user  interface  for  accessing  all  the  PAWS 
tools.  This  is  a  hierarchical  menu-driven  system,  supporting  multiple  windows  for 
programming,  interpreting  and  evaluating  parallel  algorithms.  Called  the  IGDT,  this  displays 
graphs  hierarchically,  allowing  the  user  to  select  a  node  and  expand  or  contract  die  field  of 
view  to  display  as  many  nodes  as  desired.  The  same  interface  is  used  to  describe  all  the 
application  co^,  architecture  descriptirm  and  assessment  options.  Similar  menus  support 
specificaticm  of  confiler  options  wl^  the  Ada  conf tier  is  invoked  and  selection  of 
perfcnmance  metrics  during  evaluation. 

The  normal  mode  of  using  PAWS  for  assessing  an  algorithm  is  to  tun  a  compiled 
program  on  a  theoretical  architecture  assuming  an  ideal  number  oe  arrangement  of 
processors  for  the  program  first  Then  die  program  is  tun  on  a  variety  of  architectures  more 
closely  approximating  intended  targets.  The  effective  speedup  or  other  cost  metric  can  be 
deterrnined  from  the  monitoring  data  collected  fot  each  run.  The  algorithm  can  be  debugged 
using  the  same  kind  of  facilities  on  die  ideal  machine,  or  on  adaptations  to  reveal  different 
possibilities  of  data  and  processor  usage. 

The  architecture  description  ftamework  is  just  a  graph  structure,  built  with  die  same 
tools  as  the  basic  PAWS  facilities.  This  structure  includes  relationship  to  a  predefined 
architecture  characterization  structure.  This  describes: 

•  die  number  and  flexibility  of  different  functional  units, 

•  the  number  of  processors, 

•  memory  bandwiddis  and  memory  hierarchies, 

•  die  types  interprocessor  communication  mechanisms. 

There  are  five  areas  hierarchically  partitioned  into  a  fine  enou^  detail  to  provide 
desired  granulati^  of  measurement  Time  areas  are  computation,  data  movement 
communrcation,  control  and  I/O. 

Both  parallelism  and  execution  profiles  are  generated/computed  by  walking  die 
program  dataflow  graph.  The  walk  traverses  the  graph  recording  and  evaluating  each  node's 
performance  and  statistical  data.  To  handle  compound  nodes,  the  walk  recurses,  thus 
allowing  statistics  for  individual  procedures  and  blocks  to  be  recorded  and  maintained 
indivkhiaily  and  evaluated  hierarchically.  Recursive  calls  in  die  application  ate  in^lemenied 
as  bounded  loops,  with  a  bound  based  on  estimation  or  actual  data  from  previous  runs. 
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PAWS  provides  a  way  of  modeling  almost  any  aichitectuie.  By  using  IFl  and 
peiforming  all  analysis  on  an  aibitraiy  graph,  any  level  of  computation  can  be  nKxleled  and 
dien  mapped  to  an  architecture  using  graph  pattern  matching.  Data  for  comparing 
performance  on  categories  of  architecture  could  be  obtained  by  adjusting  the  architecture 
graph,  thus  allowing  the  equivalent  of  general  simulation.  Information  on  processor  usage 
and  communication  overhead  can  be  obtained  in  a  structured  way  with  a  flexible  machine 
descriptioiL  This  facility  is  an  important  technique  for  evaluating  processor  decort^xMition 
and  interconnection  strategies. 

The  system  has  an  Architecture  Characterization  Tool  diat  lets  a  user  specify  the 
architecture.  A  machine-independent  version  of  the  application  (a  data  dependency  graph)  is 
then  generated  frcxn  a  standard  hi^-level  language  with  the  Application  Characterization 
Tool,  and  is  "executed”  (m  the  machine.  The  performance  of  the  proposed  architecture  is 
profiled  with  a  Performance  Assessment  Tool  and  a  graphic  diq>lay  is  implemented  via  an 
hiteractive  Graphical  Display  Tool  These  tocds  generate  parallelism  proves  and  ^)eedup 
information  in  an  intuitive  graphical  format 

PAWS  is  unique  in  that  it  provides  for  an  assessmoit  of  the  synergy  between  an 
application  dcxnain  and  its  architectural  platform.  Witii  these  tools  a  user  can  determine  the 
efficacy  of  a  particular  machine  organizaticm  for  his  particular  problem  set 

PProto 

PProto  (for  Parallel  Prototyping  system)  is  a  CASE  tool  with  graphical  input  oi 
parallel  systems  specified  as  fiuictioiud  elements  maiq)ed  to  parallel  architecture  models  for 
simulation  and  ev^uation.  Functional  elements  are  refined  into  intetpretable  components 
and  used  to  debug  and  measure  models  of  algorithms  on  various  architectural  m^els. 
PProto  provides: 

•  prototyping  ciq}atnlities  addressing  functional,  structural,  tuning  and  behavioral 
abstraction  issues,  allowing  multiple  versions  of  a  system  prototype  to  be  noodeled 
and  evaluated 

•  concurrent  execution  model  involving  systems  of  multiple  communicating 
processes  simulated  on  very  general  mdti-process(»:s 

•  suf^xrrt  for  top-down  incremental  develc^nnent  via  component  reuse 

•  siq>port  for  staying  software/hardware  mapping  tradet^s  throu^  identifying 
architectural  resource  impacts 

•  sophisticated  simulation  capabilities,  including  utilities  for  interpretation, 
scheduling,  resource  modeling,  instrumentation,  animation  and  debugging 

A  PProto  specification  is  a  hierarchical  data  flow  graph  consisting  of  processor 
elonents,  data  storage  units,  communication  connections  and  ports.  The  behavior  of  each 
leaf  of  a  model  is  described  with  a  simple  structured  programming  language  (SDDL 
— Structured  Data  Definition  Language)  which  is  supported  by  die  PPrmo  interpreter. 

Tools  include: 

•  graph  editor,  for  producing  data  flow  graphs, 

•  behavior  editor,  for  editing  SDDL  node  tehaviors. 
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•  schema  editor,  for  menu-based  editing  of  data  storage  units, 

•  reuse  facili^,  for  saving  conqx>nents  of  models, 

•  architecture  editor,  for  gnqph^  editing  of  hardware  topologies, 

•  nug)ping  editw,  for  specifying  software/hardware  component  mappings, 

•  prototype  ^ulatcv,  for  execution,  animation  and  debugging. 

PProto  is  a  process  description  language  and  persistent  database  tiiat  supports 
mnd^ling  and  performance  analysis  for  parallel  hardware  design.  It  uses  behavtoral 
modeling,  witii  process  models  and  a  concurrency  notation  tiiat  includes  guards  and  actions 
similar  in  spirit  to  Dijkstra’s  guarded  commands.  PProto  allows  direct  mapping  of  these 
high-level  specifications  to  target  hardware  descriptions  that  can  include  processors, 
memories  and  buses  as  basic  configuration  items.  It  can  perform  simulation  analysis  with 
mechanisms  similar  to  debugging  commands,  and  supports  user  programmed  monitoring  at 
the  model  level  It  reports  statistics  on  communications,  including  sue  buffers,  which 
enables  detectirai  of  bottlenecks. 


TANGO 

Tango  (Transition  based  ANimation  Generation)  is  an  object  oriented  algorithm 
animation  system  developed  at  Brown  that  lets  tiie  programmer  illustrate  the  ’’interesting 
evoits"  within  a  program  witiiout  having  to  write  any  low  level  graftiiics  code.  This  package 
can  be  used  with  the  FIELD  programming  oivironment  described  earlier. 


VMMF 

The  Virtual  Machine  for  MultiprocessOTs  (VMMP)  is  a  software  package  that 
provide*  a  set  of  services  to  write  parallel  ^licati(»5  for  MEMD  multqnocessors  in  a 
machine  independent  fashion.  This  is  done  abstracting  common  requirements  rtf  parallel 
algorithms  into  a  set  of  procedure  calls  whose  inq)lementation  are  hidden  fiom  the  user. 

The  author  identifies  2  paradigms  for  parallel  algoritiuns:  "tree  computations"  and 
"crowd  computations".  Tree  coo^utatitxis  are  a  strategy  which  solves  problems  by  breaking 
them  into  subproblems  and  assigning  the  subproblems  to  diild  processes.  "Crowd 
computations"  are  groups  of  processes  executing  the  same  code  on  distributed  data  and 
synchronizing  with  messages.  In  other  words  a  message  passing  based  data-parallel 
approach.  VMMP  also  ofien  a  set  of  functions  tiuu  creates  shared  objects  available  to  all 
processes.  Shared  objects  are  inovided  so  that  the  programmer  can  distribute  data,  collect 
results  etc. 

An  exan^le  ci  a  tree  cooptation  entry  points  is  Vcall  Vcall  creates  a  child  process. 
Vwait  allows  the  parent  process  to  Mock  until  its  children  tensinates. 

An  exan^le  of  a  crowd  services  is  Vcrowd.  Vcrowd  creates  some  number  (which  is 
passed  in  as  an  argument)  of  processes  all  executing  the  same  code.  Vsend  allows  the 
programmer  to  send  a  message  to  any  or  all  jxocesses  within  a  crowd.  There  are  functions 


to  q)ecify  a  topdogy  to  a  crowd  to  help  optimize  communication.  VMMP  supports  a  ring, 
torus,  m^,  n*cube  and  tree. 

VMMP  supports  associate  functions  that  operate  on  distributed  data  via  the  Vcombine 
entry  point  Associate  operations  are  used  with  shared  data  objects  which  can  be  created 
withVdcf. 

The  VMMP  hides  low  level  communication  and  synchronization  requirements  of  a 
problem  from  the  programmer.  Instead  he  is  offered  a  set  of  high  level  operations  that  are 
typically  used  in  parallel  algoritiuns.  These  parallel  ctxistructs,  however,  have  known 
communication  teid  synchronization  requirements  for  a  particular  architecture.  This  allows 
run-time  qptimizations  to  be  made  transparent  to  the  user.  Never  the  less,  only  a  good 
matiiage  b^een  hardware  and  parallel  coding  paradigm  can  insure  an  efficient  program. 
For  instance,  shared  memory  operations  entail  a  hi^  communication  overhead  on  a  MIMD 
machine  which  uses  discrete  naemcxy  for  each  processor.  A  general  disadvantage  of  this 
kind  of  approach  is  that  the  relation^p  betweoi  a  chosen  construct  and  its  efficiency  is  not 
obvious. 

The  VMMP  has  been  implemented  on  two  shared  memory  multiprocessors  (an  ACE 
and  MMX)  and  three  uni-processors,  and  is  being  implemented  on  a  distributed  memory 
multi-processor.  It  is  used  as  the  intermediate  language  for  a  portable  parallel  Pascal 
Conner. 

The  main  advantage  of  this  ^stem  is  it  portability.  It  is  apprc^nriate  far  medium  and 
large  grain  parallelism. 


Parallel  Languages 


C* 


C*  is  a  data-parallel  language  that  is  based  on  C  and  €++.  C*  was  developed  at 
Thinking  Machines  (Corporation  for  the  Ctxinectitm  Machine.  The  aggregate  data  structure 
is  the  "domain".  A  new  statement  Qrpe,  the  "select"  statement,  defines  parallel  executitm 
within  domains. 

The  sequential  code  that  each  processor  executes  in  parallel  is  standard  C  code.  The 
index  widiin  a  domain  specifies  the  virtual  processor  that  a  domain  element  resides  on. 
Incrementing  or  decrementing  a  domain  in^x  identifies  the  adjacent  processor.  This  gives 
domains  a  grid-like  shape. 

C*  is  completely  synchronous.  Virtue  processors  execute  their  instructions  as  if  the 
host  processor  were  bro^asting  the  instruction  at  all  the  processcns  in  typical  SIMD 
fashion.  MIMD  operations  can  be  executed  by  dereferencing  pointers  to  functions  because 
dereferencing  operations  occur  in  parallel 
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It  is  interesting  that  C*  programs  have  been  used  successfully  to  develop  data  parallel 
programs  (xi  a  Glared  meoaory  MIMD  machines  (Sequent),  and  on  a  hyper  cube  (NCUBE 
64).  These  efforts  support  the  contention  diat  data-parallel  programming  has  iq)plicability  to 
a  broad  class  of  architectures. 


Crystal 

Oystal  is  a  functional  language  developed  at  Yale  University  which  supports  "macro" 
data  parallel  programming.  Oystal  assumes  a  macro-parallel  machine  model  with  die 
following  characteristics : 

1.  The  machine  consists  of  processors,  each  with  local  memory,  named  by  a 
coordinate  syston  (an  index  set)  that  will  be  mapped  to  the  processor  space  of  the 
machine  by  the  confer.  Each  processor  can  have  local  variables  and  read  only 
constants. 

2  Processors  can  execute  two  kinds  of  operations:  they  perform  a  cooqiutation  (by 
executing  processing  foncdons)  or  diey  communicate  widi  other  processors  (via 
communications  functions).  Processing  functions  operate  on  local  variaUes.  They 
take  as  arguments,  other  local  varudiles,  constants  and  remote  arguments  passed  in 
frmn  other  processors.  These  remote  arguments  are  gotten  fiom  other  processors 
by  executing  communication  functions.  Remote  argument  values  are  sometimes  the 
jnoduct  of  merging  values  (a  many  to  one  mapping)  from  many  processors. 

3  The  crystal  con^nitational  model  includes  the  notion  of  a  time  srep.  During  a  time 
step  a  processor  can  execure  a  processing  function  widiout  communication  widi 
any  other  processor.  In  other  words  a  time  step  is  a  period  of  parallel  computation 
that  does  not  require  synchronizadtxi. 

4  A  processor,  via  communication,  can  allocate  and  deallocate  groups  of  processors 
to  execute  subroutines. 

Crystal  borrows  constructs  like  sets,  tuples  and  the  notion  of  aggregate  tperators  from 
older  languages  like  SETL  and  APL.  It  is  a  functional  language  that  uses  conventional 
mathematical  notation  to  describes  a  problem  as  a  system  of  possibly  recursive  equations 
over  an  index  set  The  indices  conespond  to  virtual  processors. 

Crystal's  ^proach  to  conpUer  optimizaticxi  takes  advantage  of  the  language's 
equational  structure  to  automate  program  transfonnation.  The  notion  is  to  genoate 
equivalent  programs  from  the  original  that  have  more  efficient  patterns  of  inter-process 
communication  and  data  distribution. 

Optimization  efforts  are  helped  by  die  language's  functional  nature  vdiich  simplifies 
data  dependency  analysis.  Most  inportantly  the  language  has  two  constructs  to  help  the 
programmer  indicate  to  the  compiler  the  data  locality  and  recommended  {xooessor 
organization  for  a  problem's  solution.  These  are  the  "index  domain"  and  the  "data  field". 

An  index  domain  is  a  data  structure  that  describes  the  communication  costs  within  the 
index  set  It  gives  the  index  domain  a  "shape",  and  each  index  a  position  in  a  logical  space. 
The  data  field  is  used  to  describe  distribute  data  structures.  The  compiler  can  optimize  the 
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assignment  of  data  fields  to  index  domains.  When  this  changes  the  index  domain  the 
conq)iler  uses  a  "domain  oKxphism"  to  map  the  old  index  domain  to  the  new,  mne  efficient 
me.  Once  this  is  done,  the  equational  nature  of  the  language  allows  the  confer  to 
transfonn  the  ptogiam  mech^cally  into  an  equivalent  program  that  matches  the  more 
optimal  index  domain. 

Programmer  insight  into  the  nature  of  the  algorithm  is  needed  to  choose  the  most 
efficient  index  domain,  and  index  morphisms.  This  means  that  except  for  a  narrow  set  of 
problems,  a  Crystal  programmer  must  understand  the  parallel  nature  of  die  solution,  and  be 
explicit  with  reflect  to  data  locality  and  inter-processor  communication  as  embodied  by  die 
in^x  domain  and  data  field  constructs. 


DINO 

DINO  (for  Distributed  Numerically  Oriented  Language)  is  an  extended  parallel 
programming  language  similar  to  C*.  Users  declare  a  virtual  parallel  machine  that  is  best 
suited  to  die  algorithm,  as  well  as  a  global  data  structure  and  communication  patterns,  using 
procedures  rurming  concurrently  as  parallel  processors.  The  language  is  C  augmented  widi 
high-level  parallel  oxistructs  that  allow  die  [xogrammer  to  specify  the  data  decomposition 
scheme  and  communication  pattern  that  is  best  for  the  particular  algorithm.  The  DINO 
compiler  transforms  the  program  expressed  in  terms  of  (virtual)  data  parallel  processes  into 
an  efficient  program  whose  number  of  processes  is  die  same  as  the  number  of  available 
physical  processors.  This  reduces  the  overhead  of  many  virtual  processes  and  is  very  useful 
on  MIMD  machines  diat  have  a  small  number  of  physical  processors  and  relatively  slow 
inter-process  communication.  DINO  is  intended  to  be  used  primarily  to  write  data  parallel 
programs  on  MIMD  distributed-memory  machines.  It  enables  massively  parallel  SIMD- 
like  numerical  computation  to  be  programmed  in  a  pseudo-SIMD  fashion.  Its  effectiveness 
depends  on  the  process  contraction  optimization  performed  by  its  compiler. 


Id 


This  is  a  high  level  language  which  supports  fine  grain  parallelism  and  deterministic 
behaviOT.  Id  is  a  data  flow  language  and  its  approach  to  paralleiism  is  quite  different  from 
other  languages  in  allowing  in^licit  parallelism.  Id  assumes  that  the  programmer  should  not 
be  forced  to  identify  and  express  the  parallelism  in  a  problem.  Rather,  the  language's 
operational  semantics  should  allow  die  compiler  to  extract  implicit  parallelism. 


Id  also  supports  die  notitm  that  algorithms  written  in  it  should  be  determinate. 
Typically,  parallel  execution  of  multiple  processes  are  asynchronous  because  of  variations 
in  scheduling  and  processor  speeds.  Determinate  results  require  the  program  to  manage  its 
own  synchronization.  Functit^  languages  automatically  guarantees  deterministic  behavior. 
Id  is  a  functional  language  augmented  with  a  determinate,  parallel,  data  structure  called  an 
’l-stracture".  I-structures  resemble  arrays  and  possess  a  property  called  "non-strictness". 
Non-strictness  increases  the  (^iportunity  for  parallelism  because  elements  in  a  data  structure 
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can  be  read  before  the  all  the  elements  have  been  defined.  This  fits  die  operand  driven 
characteristic  of  data  flow  architectures  because  cqierands  diat  are  data  structure  elements 
can  be  used  as  soon  as  they  are  ready. 

Id  programs  are  conqnled  into  dataflow  graphs,  which  are  the  machine  language  for 
die  MIT  Tagged-Ttdcen  Dataflow  Architecture.  Id  is  very  well  suited  for  programming  this 
kind  of  non- Von  Neumann  target  machine.  Its  non-strict  semantics  (which  are  at  die  heart 
of  its  ability  to  cr^iture  fine  grain  parallelism  inqilicidy)  make  it  difficult  to  partition  code 
into  a  nuixdier  of  sequential  dueads  diat  can  be  run  on  conventional  sequential  and  parallel 
archirectures.  The  tagged-ttdcen  dataflow  architecture  uses  instraction-level  synchronizaticai 
to  do  all  scheduling  of  con^utatioiL  This  tfynamic  scheduling  insures  optimal  use  of 
processor  elements.  Gxnmunication  is  subsumed  in  parameter  transmission. 
SynchrcMiization  is  a  requirement  of  executitxi  of  any  function  or  operation. 

Id  is  based  (xi  the  premise  that  parallel  {xogramming  cannot  be  made  comparable  to 
sequential  programming  by  extoiding  existing  languages.  One  problem  is  diat  the 
extensirais  are  often  architecture-^iecific  and  reflea  fundamental  properties  of  memory 
hierarchies,  interconnection  topologies,  and  non-unifoim  address  spaces.  Another  problem 
is  indetemunacy-the  problem  of  data  races  on  access  to  shared  variables  requires 
complicated  co^g  ai^  debugging  methodologies.  The  program  cannot  be  guaranteed  to 
avoid  time  and  configuration  dqiendent  behavior.  Anodier  problem  is  that  in  many 
applications,  it  is  not  possible  to  maxirmze  use  of  a  hi^y  parallel  machine.  The  problem  of 
qitimmng  partitioning,  scheduling  and  synchronizatiai  activity  is  difficult  even  when  die 
application  can  be  described  deteiministi^y.  The  challenge  of  using  faster  and  more 
massively  parallel  machines  itself  leads  to  attempts  to  solve  more  con^lex  problems  better. 
In  this  environment,  no  single  problem  solving  technique  is  adequate. 

This  is  the  envimunent  where  functional  languages  with  in^licit  parallelism  match  the 
problem  domain  best  A  user's  coding  sQ^le  is  not  affected  by  cmsiderations  of  parallel 
processing  resources.  In  Id,  opportunities  fa  parallelism  arise  fiom  evaluating  die 
arguments  of  a  recursive  functicm  in  parallel,  and  fiom  concurrendy  executing  the  producer 
and  consumer  of  a  data  structure.  This  oxicurrency  never  gives  rise  to  nondetenninacy 
because  there  is  no  "iqidateable”  data  storage  and  hence  no  possibility  data  races.  Id 
allows  exploitation  of  parallelism  in  expressions,  iteratimi,  recursimi  and  allows  a  higher 
degree  and  finer  level  of  parallelism  th^  is  {xactically  obtainable  fiom  odier  approaches.  In 
addition.  Id  omtains  basic  array  structuring  facilities  that  are  new  in  purely  fuiKticxial 
languages,  and  allow  vector-based  parallelism  (the  basis  of  noost  numerical  paraUelism)  to 
be  exploited  fully.  Id  addresses  this  by  a  ccnnbination  of  language  features,  compiling 
technology  and  innovative  tagged  hardware  that  yields  a  powerfiil  unified  solution. 

The  unit  of  synchronization  is  the  I-structure.  These  are  indivisibly  written  once  only, 
and  can  be  accessed  for  reading  only  after  they  have  been  written.  They  are  the  fundamental 
synchronization  mechanism  for  all  congHitatitm.  Additioial  persistent  arrays  called  M- 
structutes  provide  support  for  non-determiiustic  parallel  graph  processing  or 
multiprocessing  and  for  resource  management  To  avoid  idling,  it  is  necessary  for  the 
processor  to  interface  with  data  flow  using  minimum  overhead  and  to  switch  rapidly  to  other 
threads  when  data  is  not  available.  Dataflow  hardware  feanires  instruction  level  forks  and 


joins  to  allow  lai^  numbers  of  threads  to  execute  on  a  single  processor.  Thread-scheduling 
is  data-driven-an  instruction  is  never  atten^ted  until  its  data  are  known  to  be  available. 
Threads  may  be  as  short  as  a  single  instrucdcm  so  that  parallelism  may  be  exploited  at  the 
finest  grain.  All  long-latency  requests  are  "split-phase"  transactitxis:  each  request  carries 
with  it  a  thread  descriptt)r  which  is  returned  to  die  processor  along  with  its  respcmse. 
Threads  may  thus  be  resumed  regardless  of  the  order  of  arrival  of  responses.  I-smictures 
cease  to  exist  when  they  are  no  Icxiger  referaiced;  M-structures  have  ^obal  properties 
associated  with  their  managers. 

Programming  in  Id  involves  a  mixture  of  mathematical  and  logical  styles,  widi  most 
known  programming  language  features  available  is  some  fom.  Aridimetic  operations  and 
expressicms,  as  well  as  basic  ALGOL-tradition  contrcd  structures  are  available.  USP-style 
recursive  function  definitions  based  on  efficioit  list  processing  are  die  fundamental  units  of 
computation.  Id  includes  a  full  library  of  bask:  arithmetic  and  string  manipulation  functions. 
Its  functional  style  depends  fundamentally  on  list  structures  and  basic  list  operations.  It 
includes  array  and  record  structures  with  a  full  set  of  basic  operations.  A  pattern  matching 
notation  is  used  to  do  a  variety  of  list,  array  and  record  selection  operations. 


Unda 

Linda  is  a  high  level  set  of  language  extensions  for  parallel  programming  developed  at 
Yale  UniversiQr.  As  described  by  its  developers  in  a  Communications  of  the  ACM  article: 

"In  the  research  community,  discussion  of  parallel  progrimming  models 
focuses  m^y  on  message-passing,  ccmcurrent  objc^  oriented 
programming,  concurrent  lo^  languages,  and  functional  programming 
systems.  Linda  bears  little  rnemblance  to  any  of  these." 

Linda  involves  the  "explicit  creation  of  parallel  execution  threads"  modeling 

activities  as  a  collection  of  active  and  passive  "niples”.  Active  tiq)les  ate  a  bit  like  processes. 
Passive  tuples  can  be  thought  of  as  data  available  to  anyone  that  wants  it 

There  ate  four  operations  defined  over  "tuple  space";  out  creates  a  passive  tuple  (some 
data).  In  selects  an  existing  tuple  that  matches  a  criteria  and.  removes  it  fi^  "tuple  space". 
Should  the  sought  for  tuple  not  exist  the  "in"  statement  blocks  until  the  data  becomes 
available.  Read  acts  like  "in"  except  it  reads  the  tuple,  but  does  not  remove  it.  Eval  creates  an 
active  tuple  (data  and  some  code  that  acts  on  it).  >^en  the  code  fiiushes  executing,  the 
result  is  a  tww,  passive,  tiq)Ie  left  in  tuple  space. 

Linda  operations  can  be  added  to  any  language,  and  have  been  used  with  FORTRAN, 
C  and  C-H-,  Scheme,  MODULA-2  and  oAers.  Linda  promotes  an  "uncoupled 
programming  style”  that  does  not  require  any  specific  relationships  (client-server,  master- 
slave).  All  the  main  models  of  parallel  programming  have  been  successfully  programmed  in 
IJnda-C,  witii  noticeable  improvements  in  understahdatnlity  and  {»ogram  usalxlity.  The 
Linda  object  manager  is  a  "super  operating  system”  that  performs  dynamic  object 
management  under  concurrency  constraints.  This  is  a  very  complex  interpreter,  that  mns  on 
Sun  and  VAX  workstations. 
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Linda's  authors  fed  that  the  language  is  not  suited  to  SIMD  aichitectures,  iHit  would 
be:  "an  excellent  language  for  massively  parallel,  fine  grained,  asynchronous  machines". 
Tinda  systems  are  commercially  available  for  use  on  homogeneous  networks  of  Sun 
wrakstadons,  shared  memory  MIMD  machine  manufactured  by  Encore  and  Sequent,  as 
well  as  die  distributed  memory  Intel  iPSC/2  hyper  cube. 

T.inda  is  especially  suitable  for  nxxieling  in  a  very  general  environment  diat  does  not 
require  programmer  interaction  with  resource  control.  Modem  distributed  MIMD 
environments  are  probably  the  area  of  its  greatest  potential. 


*IJsp 


*Lisp  is  a  parallel  extension  of  the  Common  Lisp  language,  and  has  the  same  syntax 
and  style  as  Common  Lisp.  ^Lisp  adds  one  major  data  type  to  Common  Lisp:  the  parallel 
variable,  or  pvar.  This  is  an  abstract  data  object  diat  represents  die  concept  of  a  value  stored 
in  die  memory  of  each  processor  on  the  CI^  *Lisp  also  adds  a  large  number  of  fiincticxts 
and  macros  that  r^ierate  exclusively  on  pvars.  Among  these  operatras  are  parallel 
equivalents  for  many  of  the  operators  availabte  in  Common  lisp,  as  weU  as  special-purpose 
operators  that  perform  such  CM-specific  tasks  as  processor  selection,  interprocessor 
communicadon,  and  scanning. 

*Lisp  is  available  in  two  versions:  as  an  interpreter/coixpiler  combinadon  frv  the  CM 
hardware,  and  as  a  stand-alone  simulator.  The  ^Lisp  interpreter  and  conqiiler  are  extensions 
of  the  existing  interpreter  and  coirqiiler  in  Common  Lisp,  and  "'Lisp  programs  are  written 
and  conqiiled  no  differendy  fiom  Comnxm  Lisp  programs.  The  *Lisp  simulator  runs 
entirely  on  the  fix>nt-end  cooqiuter  and  simulates  the  operations  of  an  attached  CM  Code 
developed  using  the  *Lisp  simulator  can  always  be  ported  direcdy  to  the  "'Lisp 
interpreter/compiler  on  the  CM  hardware  widi  few  modifications.  However,  code  conqiiled 
using  the  simulator  must  be  recompiled  to  run  on  the  hardware. 


Molecide 

Molecule  is  a  language  effort  that  involves  defiiung  new  classes  of  program  constructs 
called  "molecules".  A  molecule  belongs  to  one  of  three  levels  of  program  abstraction.  The 
highest  level  uses  data-flow  semantics.  These  molecules  are  used  to  write  applications  in  a 
machine  independent  way.  The  second  level  is  the  "mode"  layer.  Molecules  of  this  class 
corre^xxid  to  the  architectural  characteristic  of  the  target  machine  and  the  kind  of 
synchronization  and  commumcation  facilities  tiiat  are  appropriate  with  that  style  of 
hardware.  For  instance,  molecules  in  the  mode  layer  nti^t  concurrency  primitives 
(send/teceive)  for  use  on  a  discrete  memory  MI>^  message  passing  mact^e.  Each  class 
of  "mode"  molecule  represents  a  kind  of  intermediate  language  that  is  tailored  to  a  particular 
kind  of  parallel  machine.  A  program  tiwsform  tool  called  a  "molecularizer”  converts  the 
application  layer  noolecules  into  a  program  expressed  in  the  molecules  for  the  a  mode. 
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The  lowest  level  of  molecule  is  the  machine  layer,  which  is  the  language  fw  which  a 
compiler  on  the  target  exists.  A  "source  code  to  source  code  pre-compiler"  is  envisioned 
that  would  talre  the  ou^ut  of  the  molecularizer,  and  translate  it  into  the  appropriare  language 
for  the  cooler  automatically. 

The  emphasis  of  this  project  is  to  support  a  programming  methodology  that  allows  a 
programmer  to  write  in  a  dm  flow  language.  The  researcher  feels  that  such  a  data  flow 
language  is  the  most  user  friendly,  and  totally  divorced  him  from  the  architectural 
requirements  of  the  execution  environment  Then  tools  exist  to  perform  transforms  on  that 
language  into  an  intermediate,  and  thai  a  machine  language  suited  to  the  class  of  machine  of 
interest 

The  layered  software  development  iqtproach  outlined  above  has  been  tried 
successfully  oa  a  parallel  matrix  inversitm  proWem  uang  a  data-flow  and  parallel  dialect  of 
PAL  (a  Past^  like  language)  for  the  application  and  naode  layer.  The  machine  layer 
language  is  iPSC  hyper  cube  C  with  message  passing  extensions. 


MtdtiLisp 

Multilisp  is  a  version  of  a  Lisp  dialert  called  Scheme  developed  at  MTT.  Variants  of 
the  language  are  available  on  a  several  parallel  con^uters  including  die  Encore  Multimax, 
the  BEN'S  Butterfly,  and  die  AUiant  FX/8. 

Parallel  versions  of  lisp  typically  support  concurrency  by  adding  an  extension  called 
a  "future"  to  the  base  language.  The  form,  (future  X),  immediatoly  returns  a  future  for  X 
plus  creates  a  task  to  evaluate  X.  The  future  of  X  is  can  be  thought  of  as  a  place  holder  for 
the  evaluated  value  of  X  It  can  be  used  until  the  value  beconnes  available.  The  future  should 
eventually  resolve  to  a  value  when  the  child  task  spawned  to  evaluate  X  completes  its  work. 

If  a  task  "T'  tries  to  use  a  future  of  X  when  it,  in  fact,  needs  the  resolved  value,  'T' 
will  suspend  until  the  child  process  is  through  confuting  it  For  some  operatim^s,  like  the 
movement  of  a  value  from  one  location  to  anodier,  as  in  an  assignment  statement,  or  the 
passing  of  a  parameter,  the  future  can  be  used  while  the  value  is  being  resolved.  This  allows 
a  style  of  computation  similar  to  dataflow. 

The  future  can  be  used  to  perform  data  parallel  operations.  For  instance,  the  MultiLisp 
library  function  pmapear  takes  the  future  of  a  function  for  every  element  in  a  list  This 
results  in  the  same  function  being  evaluated  in  parallel  for  all  elements  in  the  list. 

(^tics  of  MultiLisp  contend  that  the  language  is  not  transparent:  it  is  difficult  to  know 
when  to  use  the  future  ctmstruct  or  how  much  it  will  cost 


Occam 


Occam  is  based  on  the  Communicating  Sequential  Process  (CSP)  construct  developed 
by  Hoare.  An  Occam  program  consists  of  active  agents  called  "processes"  that 
communicate  with  each  other  via  one  way,  point-to-poini  links  between  processes  called 
"charuiels".  There  are  three  kinds  of  processes:  assignment,  input  and  ouqrut  Assignment 
processes  change  the  value  of  variables.  Input  processes  read  fiom  channels,  while  an 
ouqrut  process  writes  to  them.  Such  communication  is  synchronized,  i.e.  the  inputting 
process  will  wait  for  another  process  to  ouqjut  to  the  channel. 

These  elemental  processes  are  like  statements  in  a  conventional  language.  They  can  be 
ccnnbined  into  a  series  oH  actions  that  occur  sequentially  witii  a  "seq"  statement  Or  they  can 
be  forced  to  occur  in  parallel  with  a  "par"  statement  Such  compilations  are  themselves 
processes,  and,  as  such  can  be  used  in  a  par  seq  statement  A  par  process  blocks  until  all 
the  processes  within  it  con^lete. 

Occam  is  strongly  typed  and  the  scope  of  objects  is  limited  to  the  process  immediately 
following  the  declaration.  Processes  can  only  communicate  with  channels.  Occam  does  not 
support  shared  variables. 

Occam  also  supports  an  iterative  construct  "while"  and  two  contrP  flow  constructs: 
"if*  and  "alt".  The  "iT  is  similar  to  "if*  statemmts  in  ctxiventional  languages.  The  "alt" 
statement  selects  (me  of  a  group  of  alternatives  depending  (m  the  state  of  a  channel. 

Programming  style  in  Occam  connects  large  processes  via  a  channel.  The  data  is  read 
from  a  channel,  manipulated  by  inserting  pto(%sses  afra  the  receipt  of  the  data,  and  output 
to  a  different  chaiuieL  It  was  develc^ied  to  program  the  INMOS  transputer  and  as  such,  is  a 
gO(xl  match  frx’  MIMD  architecture  with  discrete  memory  where  die  communicaticm  costs 
to  the  adjacent  processors  is  cheap. 

Paralation  Model 

The  Paralation  model  is  qiecificaticm  for  language  extensicms  consisting  of  a  new  data 
structure  —  (a  "paralation")  and  new  operations  ("elementwise  evaluatitm",  "mapping"  and 
"match").  They  comprise  a  parallel  programming  model  that  "can  be  combined  with  any 
base  language  to  produce  a  concrete  parallel  language."  The  resulting  ccxle  is  easy  to 
conqiile  efficiently  because  the  constructs  ccmtain  explittit  ncm-machine  dependent 
informati(m  about  data  locality,  impropriate  processor  tc^ology,  and  synchronization 
requirements.  It  is  machine  independent  because  compilers  can  choose  to  use  or  ignore  this 
informaticm  depending  on  die  machine  organizatitm. 

Thoe  is  a  versi(m  of  Paralaticm  lisp  based  cm  comnxm  lisp  that  runs  cm  the 
Ckmnecticm  Machine.  A  versicm  of  Paralation  C  is  under  developtncnt 

A  "paralati(m"  is  conmosed  of  "fields".  A  Held  is  composed  of  elemoits,  which  are 
pieces  of  data  that  are  all  the  same  size.  Every  paralation  has  one  field  that  contains  its 
indices.  An  index  numbers  can  be  thought  of  as  specifying  a  "site"  or  processor. 
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A  rough  analogy  to  a  paralation  is  a  table  where  each  column  is  a  field.  Each  row  has 
an  index  number  that  specifies  the  "site"  or  processor  that  the  row  specified  by  that  index 
number  is  processed  on. 

The  model  guarantees  locality  fcv  data  used  at  a  processing  site,  and  for  all  the 
processing  sites  within  a  paralation.  This  means  all  the  elements  in  a  row  will  be  smred 
close  together  in  memory.  Addititxially,  all  the  processors  associated  with  a  particular 
paralation  are  logically  "close"  to  each  other.  TlU  distance  between  sites  within  a  paralation 
is  determined  by  the  "shape"  of  the  paralation.  The  shape  is  under  programmer  control.  For 
example,  if  a  paralation  were  sh2^)ed  like  a  ring  and  had  3  sites,  the  shape  captures  the  faa 
that  site  1  is  logically  1  site  away  from  bodi  site  2  and  site  3. 

The  Paralation  model  supptxts  three  operations:  "Elementwise  evaluation",  "mapping” 
and  "move".  Elementwise  evaluation  is  an  operation  diat  acts  on  a  field  of  the  paralation.  It 
perfomos  the  same  operation  on  each  element  and  creates  a  new  field  containing  the  result 
Since  each  element  in  a  field  resides  at  a  differmt  site,  operations  occur  in  parallel.  The 
elementwise  evaluation  is  coaplete  when  aU  the  sites  finish  executing  this  code. 

Elementwise  evaluation  is  the  (xily  construct  for  synchronization  supported  between  sitts. 
Any  code  that  requites  synchronization  within  die  elementwise  evaluation  is  considered  an 
errw. 

Data  is  moved  between  paralations  by  the  "move"  operator.  Elements  to  be  moved  are 
determined  by  a  "mapping".  A  mapping  is  created  by  a  "match”  operation.  Match 
tolerations  allow  the  programmer  to  identify  elements  within  one  paralation  that  are  to  be 
associated  with  elements  in  a  different  paralation.  The  "move"  can  be  thought  of  as  a  kind  of 
parallel  assignment  statement  which  over-writes  a  field  in  a  paralation  with  data  from 
anotho*  parsdation. 

The  Paralation  model  is  compatible  with  any  parallel  architecture,  but  the  author 
crxitends  that  "a  crossbar-based  shared  memory  machine  is  not  the  ideal  target".  This  is 
because  the  shared  memory  implementation  "ignores  the  locality  properties  of  paralations 
and  fields."  The  author  goes  on  to  state  that  "diis  is  inherent  in  the  idu  of  shared  memory 
hardware,  which  hides  the  cost  of  coinmunicati(»."  In  other  words,  information  about  the 
locality  of  a  paralation  is  of  no  use  if  the  cost  of  communicatimi  from  one  processor  to 
another  is  always  the  same.  This  model  is  best  suited  to  medium  or  fine  grained  MIMD  (x- 
SIMD  machines. 

PCS 

PCN  (Program  Composition  Notation)  is  an  iii^)erative  block-structured  expression 
language  with  guarded  commands  for  parallelism.  It  attracts  to  combine  the  functional 
style  of  Lisp  for  structuring  blocks  and  stattment  segments,  with  a  concise  style  similar  to  C 
for  composing  statement  and  expression  sequences.  The  declaration  and  block  structure  is 
intended  to  map  to  data  flow  graphs,  and  the  composed  segments  to  the  actions  of  iKxles  in 
diis  grtqjh.  Relating  segments  to  nodes  allows  noonitoring  and  tracing  to  be  done  according 
to  the  structure  specified  by  the  program,  and  gives  a  programmable  way  to  help  make 
execution  sequences  meaningful  for  dynamic  and  post  mortem  analysis.  Two  tools  Gauge 


and  Upshot  are  part  of  the  interpretation  systeni  intended  to  this  nwnitoring  facility  to 
provide  execution  profiles  (e.g.,  to  improve  load  balancing)  and  collect  trace  information  to 
assist  in  performance  monitoring  and  tuning.  Upshot  can  be  used  to  do  complicated  event 
tracing  for  post  nooitem  analysis,  debugging,  and  even  testing  in  the  presence  of  potential 
race  ccmditions  or  deadlock. 

The  PCN  system  is  a  combination  of  a  language,  interpreter  and  lilsary.  It  supports 
hi^-level  programming  and  debugging  as  well  as  the  special  kinds  of  performance 
assessment  already  highlighted.  It  allows  interface  to  FORTRAN  and  C  programs  via 
interface  directives,  and  is  intended  to  facilitate  the  connection  of  comtrxxily  used  sequential 
segments  using  some  simple  notions  to  develop  a  parallel  program: 

•  defiiutional  variables,  for  controlling  ti»  exchange  of  information  between 
concurrently  executing  potentially  interfering  program  segments 

•  concurrent  composition,  for  building  parallel  programs  by  composing  segments 
into  systems  having  a  number  of  independently  executing  threads 

•  controlled  nondeterministic  choice,  for  expressing  intrinsic  parallelism  of  program 
segments,  following  Djistra's  notion  of  guarded  commands  where  the  gut^  need 
not  be  mutually  exclusive 

•  encapsuiatioxi  of  state  change,  by  restricting  use  of  data  structures  subject  to  state 
change  to  program  segments  where  they  are  not  shared 

This  composition  strategy  is  very  similar  to  tire  notions  of  CODE,  except  tirat 
hierarchies  of  sequential  segments  can  be  buUt  and  synchronization  of  shared  data  is  based 
(m  a  distributed  rather  than  stractured  global  data  basis.  This  is  generally  more  efficient,  and 
certainly  more  programmable.  PCN  is  oriented  towards  making  parallel  system 
programming  a  practical  reality,  and  supporting  monitoring  and  tuning  in  a  style  similar  to 
cunent  sequential  develqrment  methods. 


Proteus 

This  is  a  wide-spectrum  language  being  developed  at  Duke  with  a  range  of  parallel 
ccmtrol  constructs,  intended  to  provide  a  complete  set  of  possible  notations  for 
programming  parallel  algorithms.  It  supports  a  set  of  synchronization  primitives  intended  to 
allow  mutual  exclusion  on  concurrent  objects,  with  mechanisms  provided  to  map  the  control 
primitives  to  MIMD,  SIMD  or  specialize  architectures.  It  supports  specification  of  process 
sequencing  and  time  constraints  to  allow  flexible  target  independent  real-time  programming. 
It  coaq)etes  with  languages  such  as  Crystal  which  encourage  refinement,  but  supports  mcxe 
specific  and  constrained  concurrency  implementation. 


RAPIDE 

Developed  at  Stanford  University,  RAPIDE  is  a  process-structured  expression 
language  with  guarded  commands  for  parallelisoL  It  atten^ts  to  combine  tiie  functional 
style  of  Lisp  for  structuring  blocks  and  statement  segments,  with  a  process-oriented 
structure  similar  to  VHDL  for  composing  statement  and  expression  sequences.  The  process 
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structure  is  intended  to  map  to  data  flow  graphs,  and  the  composed  segments  to  the  events 
associated  with  nodes  in  dds  graph.  Relating  events  to  nodes  allows  monitoring  and  tracing 
to  be  done  according  to  the  structure  specified  by  the  program,  and  gives  a  programmable 
way  to  help  make  execution  sequences  tneaiun^ul  for  dynamic  analysis  and  verification. 


Strand~88 

Strand  is  a  single  assignment  language,  with  a  functional  programming  style, 
supporting  six  programmable  communication  methods  that  model  various  kinds  of  parallel 
architecture.  The  fundamental  notion  is  communicating  processes,  which  syiKhtoiuze  using 
data  flow  ctmstraints  according  to  data  availatnlity.  Strand  is  a  parallel  programming  tool 
that  includes  a  development  environment  and  parcel  programming  libiaries.  Strand 
ccm^utes  via  reduction  of  a  pool  of  extremely  li^t  weight  processes  defined  by  evaluating 
guarded-Hora  clauses.  Processes  commimicate  by  reading  and  writing  shared  variables 
(commuiucation  chaimels)  which  can  only  be  defiited  once.  A  computation  step  involves 
removing  a  process  from  a  pool  If  the  process  is  predefined,  then  it  is  executed 
immediately;  otherwise  a  reduction  attend  is  made.  This  involves  data  flow 
synchronizatiotL  The  process  must  wait  until  its  constraints  are  satisfied;  it  then  changes 
state  and  forks  new  goal  processes,  which  are  added  into  the  process  pool.  This  process 
ctxitinues  until  tiie  pool  is  empty.  Gxnputation  is  then  conplete. 

The  programming  system  supports  a  variety  of  virtual  machines  and  load  balancing 
tools  to  ease  the  programming  tasks  and  allow  applications  to  scale  when  additional 
hardware  becomes  available.  It  is  transportable  across  different  MIMD  shared  or  distributed 
memory  machines.  To  enable  the  reuse  of  existing  code.  Strand  supports  a  multi-lingual 
mechanism  which  consists  of:  (1)  a  logic  programming  approach  that  allows  users  to 
specify  parallel  program  constructs  and  synchronizatitm  in  a  simple  way,  aixl  (2)  a  foreign 
interface  which  allows  a  program  construct  to  be  implemented  in  existing  sequential  C  and 
FORTRAN  or  by  specialized  hardware. 


Developed  at  UCXA,  UC  is  based  on  the  UNITY  noodel  of  parallel  programs. 
Develc^rment  of  UC  was  driven  by  a  desire  to  separate  efficiency  concerns  from 
programming  issues  when  writing  a  parallel  foogram.  It  is  easier  to  develop  and  maintain  a 
program  if  the  language  presents  a  single  virtual  machine  noodel  to  the  programmer.  On  the 
other  hand,  execution  costs  are  optimizi^  by  taking  advantage  of  ^)ecific  architectural 
features  of  a  parallel  processor  such  as  tiw  inter  connection  topology.  Ifowever,  with  most 
parallel  programming  languages  the  modification  of  the  code  to  exploit  a  particular 
architectural  feature  changes  the  mearung  of  the  program  perhaps  causing  a  bug,  or  possibly 
obscuring  the  algorithm  making  the  program  hard  to  maintain. 

UC  is  an  extension  of  C  with  a  few  high  level  parallel  constructs  allowing  the 
algorithm  to  be  expressed  in  an  abstract,  high-level  language.  It  adds  one  data  type  - 


in(lex_set,  one  q}erator  -  reduction,  and  four  constructs  to  express  dependencies  among 
statements  -  par,  oneof,  seq  and  solve.  It  also  disallows  goto  statements  and  only  allows  C 
pointa^  to  be  used  K>  pass  an  array  (or  array  slice)  as  an  argument  to  a  function. 

An  index_set  represents  an  ordered  set  of  integers,  each  of  which  must  be  a  constant 
expression.  The  reduction  operator  is  used  to  perform  a  binary  operation  on  a  set  of 
(q>erands.  The  par  construct  indicates  that  the  following  statements  may  be  executed  in 
parallel,  while  the  seq  construct  indicates  the  statements  must  execute  sequentially.  The 
solve  constract  may  used  to  write  programs  that  solve  a  set  of  prq>er  equations.  The 
(xieof  ccxistruct  represents  a  non-deteiministic  choice.  Initially  the  compiler  generates  a 
default  mapping  of  the  program  data-space  to  the  target  architecture  that  allows  a 
programmer  to  develop  a  correct  prototype. 

A  separate  map  section  of  die  program  can  be  specified  tt>  override  the  default  data 
mappings  of  the  compiler.  The  data  nupping  can  be  improved  to  better  match  the 
underiying  architecture  without  changing  the  cotrecmess  of  the  progranL  As  a  result  the 
program  is  easier  to  develop  and  maintain  while  execution  remains  competitive  with  other 
languages. 


UNITY 

UNITY  is  tile  specification  language  used  by  Chandy  and  Misra  in  their  1988  book 
on  parallel  program  design.  It  was  intended  as  a  way  to  unify  parallel  programming  and 
ver^cation  in  a  single  algorithm  definiti(»  and  proof  system.  It  im>vides  a  means  of 
expressing  algorithms  precisely  in  an  atchimcture  indqiendent  manner  and  provides  a 
methodology  for  using  architecture  dependent  rdinements  to  produce  proofs  of  program 
completioiL  This  is  the  background  for  a  formal  system  for  verifying  parallel  programs. 


-92- 


UISSION 

OF 

ROME  LABORATORY 

Rome  Laboratory  plans  and  executes  an  interdisciplinary  program  in  re¬ 
search,  development,  test,  and  technology  transition  in  support  of  Air 

3 

Force  Command,  Control,  Communications  and  Intelligence  (C  I)  activities 
for  all  Air  Force  platforms.  It  also  executes  selected  acquisition  programs 
in  several  areas  of  expertise.  Technical  and  engineering  support  within 
areas  of  competence  is  provided  to  BSD  Program  Offices  (POs)  and  other 
ESD  elements  to  perform  effective  acquisition  of  C  I  systems.  In  addition, 
Rome  Laboratory's  technology  supports  other  AFSC  Product  Divisions,  the 
Air  Force  user  community,  and  other  DOD  and  non-DOD  agencies.  Rome 
Laboratory  maintains  technical  competence  and  research  programs  in  areas 
including,  but  not  limited  to,  communications,  command  and  control,  battle 
management,  intelligence  information  processing,  computational  sciences 
and  software  producibility,  wide  area  surveillance/sensors,  signal  proces¬ 
sing,  solid  state  sciences,  photonics,  electromagnetic  technology,  super¬ 
conductivity,  and  electronic  reliability/maintainability  and  testability. 


